Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingester can get stuck with growing block queue #18

Open
scottyeager opened this issue Dec 31, 2024 · 0 comments
Open

Ingester can get stuck with growing block queue #18

scottyeager opened this issue Dec 31, 2024 · 0 comments

Comments

@scottyeager
Copy link
Collaborator

There's a rare issue whereby the ingester can be queuing up new blocks to process but never processing them. While this is ongoing, there are occasional instances where some worker processes die and get respawned.

Here's a sample of logs:

2024-12-31 18:58:54.309482 processed 0 blocks in 30 seconds 10777 blocks queued 5 processes alive 0 write jobs
2024-12-31 18:59:24.876481 processed 0 blocks in 30 seconds 10782 blocks queued 5 processes alive 0 write jobs
2024-12-31 18:59:55.710535 processed 0 blocks in 30 seconds 10787 blocks queued 5 processes alive 0 write jobs
2024-12-31 19:00:26.197006 processed 0 blocks in 30 seconds 10792 blocks queued 4 processes alive 0 write jobs
More than 5 jobs remaining but fewer processes. Spawning more workers
2024-12-31 19:00:56.663733 processed 0 blocks in 30 seconds 10797 blocks queued 4 processes alive 0 write jobs
More than 5 jobs remaining but fewer processes. Spawning more workers
2024-12-31 19:01:27.195469 processed 0 blocks in 30 seconds 10802 blocks queued 4 processes alive 0 write jobs
More than 5 jobs remaining but fewer processes. Spawning more workers
2024-12-31 19:01:57.681833 processed 0 blocks in 30 seconds 10807 blocks queued 4 processes alive 0 write jobs
More than 5 jobs remaining but fewer processes. Spawning more workers
2024-12-31 19:02:42.787301 processed 0 blocks in 30 seconds 10815 blocks queued 5 processes alive 0 write jobs
2024-12-31 19:03:13.323732 processed 0 blocks in 30 seconds 10820 blocks queued 4 processes alive 0 write jobs
More than 5 jobs remaining but fewer processes. Spawning more workers
2024-12-31 19:03:43.789056 processed 0 blocks in 30 seconds 10825 blocks queued 5 processes alive 0 write jobs
2024-12-31 19:04:14.355872 processed 0 blocks in 30 seconds 10830 blocks queued 5 processes alive 0 write jobs
2024-12-31 19:04:44.812938 processed 0 blocks in 30 seconds 10835 blocks queued 5 processes alive 0 write jobs
2024-12-31 19:05:15.311260 processed 0 blocks in 30 seconds 10840 blocks queued 5 processes alive 0 write jobs
2024-12-31 19:05:45.792779 processed 0 blocks in 30 seconds 10845 blocks queued 5 processes alive 0 write jobs

This is easy enough to catch via monitoring and address manually with a restart, but I wonder if a simple logic could achieve the same automatically. For example, if processed blocks over a certain time period (five minutes) is below some threshold, then abort and let the process manager restart. Checking that we actually have connectivity to tfchain might be a nice touch, but it's not a huge deal to restart every five minutes should network connectivity be lost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant