-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Client occasionally stops importing new blocks #11758
Comments
This sounds concerning and debugging it seems non-trivial. The best thing would be a backtrace of what all the threads are doing when the node freezes up, which requires attaching Can you post your configuration please, it might help us reproduce. |
we are also experiencing similar issues after the upgrade to ver 3.0 |
dvdplm, thanks for suggestions, will try all that and report back. Attaching a config.toml from current hardware with 64gb ram that is fatdb synced. Previous hardware (that also exhibited the same problem) was fast synced and had 32gb, so configuration had smaller caches sizes, but otherwise it was similar. |
@dvdplm There is a thread backtrace on https://github.com/openethereum/openethereum/issues/11737 which seems like the same issue as this one. |
@dvdplm Could you maybe develop on what you would want exactly from gdb? Only the output of "threads apply all bt" ? I tried it once but it only spits out addresses as there is no debug symbols. That's still helpful to you ? |
@palkeo, I'll make the gdb trace the next time it happens. Can you please share your cache-sizes and ram size? |
My cache size are here: https://github.com/openethereum/openethereum/issues/11735 |
You are right, without the debug symbols in there it's not useful. My bad. |
We are also experiencing those issues since version 2.7.2 on multiple independent nodes. |
I can confirm this happens, without more data points / metrics we are blind to diagnose, @adria0 is working on having prometheus metrics and that might add more light here |
Is there anything you can think of that could be useful to you to troubleshoot it? |
Is there anything that is helpful for debugging while it is stuck ? I have a stuck node right now ... |
I had this problem too, v3.0.1, /opt/openethereum ssd disk, serverrom with 1GBps internet, restart didn't help, reboot didn't help. Before that RPC got really slow, 10 sec for account get balance, had about 3500 accounts to recover, I started over, deleted accounts, whole blockchain, it failed to sync snapshots,
very anoying, business down second day, switching to geth |
Had the issue again. Attaching openethereum shutdown log and gdb output. |
Thank you @Marko429 ! I will analyze it more later, but till then what I can see now is that threads 46,47,48,49 get stuck on SyncSupplier::dispatch_packet. Four threads correspond to our four Sync/Network IoWorkers. Threads 24,25,26,27 are Client IoWorkers. Thread 39 is last one that is deadlocked, it is called from Importers thread. |
I did re-sync again, no hardware changes, no software changes. Snapshot got synced. Now syncing blocks. I suspect there is race condition dead lock problem. |
Putting link to important gdb log: https://github.com/openethereum/openethereum/issues/11737#issuecomment-637042123 From this log I can see that Network/Sync IoWorker threads 33,34,35,36 are stuck And Client IoWorkers are: |
@rakita One thing there is in #11737 as well is a full debug logging up to the point of the hang - not sure if you have seen that (https://github.com/openethereum/openethereum/issues/11737#issuecomment-649679781) - I'm also happy to do any debugging on our node when it hangs again, or enable any extra logging you would like to try and track this issue down. |
@CrispinFlowerday I did see it and thank you for both gdb and debug log! |
Compiling the data:
|
I can confirm our node is working fine as well with the rakita/11758_use_std_rwlock branch. I have also now seen a hang of a ropsten node which hasn't occurred for us previously (using the release code - although right now I'm not sure exactly what release we are using there) |
@CrispinFlowerday and @Marko429 thank you both for the help! |
Both our OpenEthereum nodes (16 core, 16GB memory, RAID-5 SSD Pool, Archiving+Tracing) are stuck since ~3 weeks. Even restarting doesn't solve the problem anymore. The nodes are utilizing the SSD pool and just do nothing. Please raise the priority for this issue. OpenEthereum is pretty much useless at this point. Edit:
Oh and unfortunately this did not help in our case. We are now running one node with this Happy to provide logs if it helps. |
Hello @cogmon if a node does not recover after restart that means this is a different problem. On that note, we did notice a lot of regression introduced with big changes in 2.7.x, and 3.0.x is mostly based on 2.7.x with rebranding. In next release, we decided to move back to stable 2.5.x and make it ready for berlin fork. #11858 |
Hi There, thanks for the info. I have started Parity v2.5.13 and it does not find any peers:
Is there any special requirement to run this version? |
@g574, same here |
This comment has been minimized.
This comment has been minimized.
Yes @g574, Parity 2.5.13 has harcoded list of bootnodes, but parity tech are not running them. You can switch to the Gnosis maintained bootnodes by using
this should work. |
@adria0 Thanks a lot, it works. |
@islishude see #11858. |
@rakita 2.5.13 works fairly, so that looks like a good a strategy |
we're running 2.5.13 where possible as well, so yes, looking forward to it 👍 |
I'm not at all surprised, yet somehow now relieved that my issue is no longer CentOS 7 related. I'd eventually closed the old issue after adding the --reserved-peers nodes.txt argument to sync to my local Go-Ethereum archive node. |
It looks like this behavior? https://docs.rs/parking_lot/0.11.0/parking_lot/type.RwLock.html "This lock uses a task-fair locking policy which avoids both reader and writer starvation. This means that readers trying to acquire the lock will block even if the lock is unlocked when there are writers waiting to acquire the lock. Because of this, attempts to recursively acquire a read lock within a single thread may result in a deadlock. " |
@matkt, to be honest, I expected something like that when I first looked at parking_lot, but when I looked at gdb logs and code this didn't fit. It was a strange conclusion, and that is why we added logs after every lock to try to deduce who is last one that locks the lock and we got empty list. |
I started running a node with https://github.com/openethereum/openethereum/pull/11769 and https://github.com/openethereum/openethereum/pull/11766 in, and I haven't seen the issue for the past ~30h (whereas it would occur multiple times per day before) ; it might be worth merging the PRs. |
Hello @ngotchac, |
You might be right, I haven't looked to much into the PRs. However it might still be worth it to merge them if the code is OK, so that other people can try out. |
std_rwlock version is still working without fault. It's been almost a month now. I'd wager very strongly now that this issue is fixed by switching to standard rust locks. |
@CrispinFlowerday, is std_rwlock still working well for you? We are still running without a single lock up. I don't think we ever run this long since 2.7 has been out. It seems to me, the issue is solved. |
Yep, still going strong for us - been running since July 29th with no issues. |
@rakita could we please consider merging these PRs and cutting a version? It would be great to reduce the uncertainty about the next version, specifically whether we'll have to do a full re-sync. |
Please use 2.5 if this issue affects you. Next release (3.1) will be based on 2.5 which is not affected. |
Openethereum occasianally (1-3x per month) stops importing new blocks, simply goes silent without producing any error. Then we issue SIGTERM and we get "Finishing work, please wait..." message. But it does not finish in many minutes... so we kill the process with SIGKILL. Upon restarting everything works normally, until next such 'freeze'.
We first noticed this behaviour in Parity 2.7.2 several months ago and it is now present with Openethereum 3.0.0. We noticed this independently on two different hardware configurations, one fast synced, the other fatdb synced from scratch. There is a similar issue reported in #11737 and also #11539, but the latter is linked to slow block import. That is not the case here. Except for the 'freezes', everything is working responsively, fast and overall very well, block imports are very fast.
Any suggestions on how to debug this issue much appreciated. Thanks.
The text was updated successfully, but these errors were encountered: