Mainnet /network/status doesn't respond intermittently

Has anyone faced this issue that POST /network/status with “Mainnet” doesn’t respond intermittently? In my case it never responds and I have to timeout eventually.

I’ve never seen this issue happening on Testnet3 network but it happens frequently on Mainnet bitcoin.
Any suggestions?

I have yet to see this on our own instances, @kashif. Could you share some details about your setup (i.e. instance size, RAM, disk type, etc)? Is this happening on other endpoints or just /network/status? Is this happening when at tip or while rosetta-bitcoin is still catching up to tip?

My hunch is that the load on bitcoind when syncing blocks up to tip or the load rosetta-bitcoin is putting on bitcoind when it is populating indexer storage is causing a degradation of bitcoind's RPC performance. I would guess this problem would go away when reaching tip (if you use the same size machine you have now) or if you size up the machine’s CPU and RAM. Once you provide answers to the questions I asked earlier, I can provide some more specific advice.

Hi Patrick

  • its happening while the tip is still catching up (will report once synced)
  • This only happens on /network/status (but other apis like /block are hitting the indexer instead so maybe thats why they don’t see the issue)
  • I’m using 8vCPU 14GB instance for this + a 350GB hard disk
    • I understand that 16GB is recommended but I didn’t see the node reaching more than 14GB usage so it shouldn’t be the root cause but I’ll try to increase it anyways.
    • Also do you think using SSD might speed this up?

Also I tried calling the json-rpc apis manually and they are indeed taking long time to respond. Could this be because rosetta-indexer is taking up all the rpc resources? If yes then it would be nice if the indexer leaves some rpc threads free so the APIs are not jammed.

Thanks for help

Yeah, this is more or less expected during this phase with the size of instance you are using. When catching up to tip, the indexer more or less uses all available CPU. What timeout are you using?

If you size this up to a 16 vCPU, I think you’ll have much better luck. We probably need to make the indexer sync concurrency configurable. Right now, it will attempt to ramp up to ~64 concurrent block fetches + quite a few concurrent compression threads (in some cases > 256).

Is it using any swap memory? The OS will prevent it from OOM’ing and then map some virtual memory space to disk (this can really slow things down if you are using a non-SSD).

This will SIGNIFICANTLY increase performance. The indexer db is very much geared towards SSD (in fact it performs quite bad on a HDD).

Yeah, that’s my hunch. I think putting up a PR to allow a configurable max concurrency in the indexer makes sense. In short, we would add another arg here:

This is the option:

Hi Patrick

It was indeed resource issue as you mentioned. I added more RAM and SSD and now the sync is a lot faster. Also we are not seeing the rpc timeout issues anymore.
One silly mistake on our side was that our kubernetes cluster was cpu limiting the node. So it was only allowing ~ 1-2 cpus to the node which also blocked a lot.

Also configurable max concurr. would we a good idea.

Thanks a lot for help.

Kashif

1 Like

:man_facepalming: Can’t say I’ve made similar mistakes haha.

Would love to see you put up an issue for this! We use issues to track/prio updates and fixes.