Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missed heartbeat from celery@dify-worker-**** #12325

Open
5 tasks done
boHam-wang opened this issue Jan 3, 2025 · 1 comment
Open
5 tasks done

missed heartbeat from celery@dify-worker-**** #12325

boHam-wang opened this issue Jan 3, 2025 · 1 comment
Labels
🐞 bug Something isn't working good first issue Good first issue for newcomers

Comments

@boHam-wang
Copy link

Self Checks

  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template :) and fill in all the required fields.

Dify version

0.12.1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

I deploy dify images in kubernetes cluster, I start 5 dify-worker pods, and I use flower to monitor the status of worker nodes.
when I started all containers, it ran well. But when I upload a document to build index dict, the node which got this task become offline and not recovered. At the same time, I can found the logs in my log console:

# podname: dify-worker-6cb6dc8697-bp8q2 
2025-01-03 10:59:24.453 INFO [MainThread] [connection.py:22] - Connected to redis://:**@redis-cluster-master.dify-ai.svc.cluster.local:6379/1
2025-01-03 10:59:24.459 INFO [MainThread] [mingle.py:40] - mingle: searching for neighbors
2025-01-03 10:59:25.824 INFO [MainThread] [mingle.py:43] - mingle: sync with 1 nodes
2025-01-03 10:59:25.826 INFO [MainThread] [mingle.py:47] - mingle: sync complete
2025-01-03 10:59:25.856 INFO [Dummy-3] [pidbox.py:111] - pidbox: Connected to redis://:**@redis-cluster-master.dify-ai.svc.cluster.local:6379/1.
2025-01-03 10:59:25.867 INFO [MainThread] [worker.py:175] - celery@dify-worker-6cb6dc8697-bp8q2 ready.
2025-01-03 10:59:26.947 INFO [Dummy-3] [control.py:342] - Events of group {task} enabled by remote.
2025-01-03 10:59:34.291 INFO [Dummy-3] [control.py:375] - sync with celery@dify-worker-6cb6dc8697-lmjh2
2025-01-03 10:59:38.351 INFO [Dummy-3] [control.py:375] - sync with celery@dify-worker-6cb6dc8697-wr8qj
2025-01-03 10:59:40.604 INFO [Dummy-3] [control.py:375] - sync with celery@dify-worker-6cb6dc8697-6fdnr
2025-01-03 11:11:35.096 INFO [MainThread] [strategy.py:161] - Task tasks.ops_trace_task.process_trace_tasks[f87d18b4-ccad-452c-abb8-e36c0d38a6e2] received
2025-01-03 11:11:35.915 INFO [Dummy-5] [ops_trace_task.py:48] - Processing trace tasks success, app_id: 9a71b7b7-7c8b-407b-b92b-93d425b16157
2025-01-03 11:11:35.995 INFO [Dummy-5] [trace.py:128] - Task tasks.ops_trace_task.process_trace_tasks[f87d18b4-ccad-452c-abb8-e36c0d38a6e2] succeeded in 0.8925492763519287s: None
2025-01-03 11:11:40.037 INFO [MainThread] [strategy.py:161] - Task tasks.ops_trace_task.process_trace_tasks[4076ad1a-1d1e-48a8-9a12-66b0be8abc59] received
2025-01-03 11:11:40.399 INFO [Dummy-8] [ops_trace_task.py:48] - Processing trace tasks success, app_id: 9a71b7b7-7c8b-407b-b92b-93d425b16157
2025-01-03 11:11:40.473 INFO [Dummy-8] [trace.py:128] - Task tasks.ops_trace_task.process_trace_tasks[4076ad1a-1d1e-48a8-9a12-66b0be8abc59] succeeded in 0.4311293810606003s: None
2025-01-03 11:23:36.277 INFO [MainThread] [strategy.py:161] - Task tasks.ops_trace_task.process_trace_tasks[b7819f67-153b-40c7-b752-e1fd99946dc2] received
# podname:dify-worker-6cb6dc8697-6fdnr 
2025-01-03 10:59:40.575 INFO [MainThread] [connection.py:22] - Connected to redis://:**@redis-cluster-master.dify-ai.svc.cluster.local:6379/1
2025-01-03 10:59:40.582 INFO [MainThread] [mingle.py:40] - mingle: searching for neighbors
2025-01-03 10:59:41.613 INFO [MainThread] [mingle.py:43] - mingle: sync with 4 nodes
2025-01-03 10:59:41.614 INFO [MainThread] [mingle.py:47] - mingle: sync complete
2025-01-03 10:59:41.643 INFO [MainThread] [worker.py:175] - celery@dify-worker-6cb6dc8697-6fdnr ready.
2025-01-03 10:59:41.650 INFO [Dummy-3] [pidbox.py:111] - pidbox: Connected to redis://:**@redis-cluster-master.dify-ai.svc.cluster.local:6379/1.
2025-01-03 10:59:41.944 INFO [Dummy-3] [control.py:342] - Events of group {task} enabled by remote.
2025-01-03 11:12:33.123 INFO [MainThread] [strategy.py:161] - Task tasks.ops_trace_task.process_trace_tasks[aff71b6d-b733-4492-ae48-f6d4d5ba2a50] received
2025-01-03 11:12:33.732 INFO [Dummy-5] [ops_trace_task.py:48] - Processing trace tasks success, app_id: 9a71b7b7-7c8b-407b-b92b-93d425b16157
2025-01-03 11:12:33.783 INFO [Dummy-5] [trace.py:128] - Task tasks.ops_trace_task.process_trace_tasks[aff71b6d-b733-4492-ae48-f6d4d5ba2a50] succeeded in 0.6574982842430472s: None
2025-01-03 11:14:07.699 INFO [MainThread] [strategy.py:161] - Task tasks.ops_trace_task.process_trace_tasks[cfd961eb-2b3b-4d15-94ae-fee61f54c30c] received
2025-01-03 11:14:08.106 INFO [Dummy-8] [ops_trace_task.py:48] - Processing trace tasks success, app_id: 9a71b7b7-7c8b-407b-b92b-93d425b16157
2025-01-03 11:14:08.223 INFO [Dummy-8] [trace.py:128] - Task tasks.ops_trace_task.process_trace_tasks[cfd961eb-2b3b-4d15-94ae-fee61f54c30c] succeeded in 0.5212409980595112s: None
2025-01-03 11:15:04.701 INFO [MainThread] [strategy.py:161] - Task tasks.ops_trace_task.process_trace_tasks[fbf14db5-c761-4fb6-a1e1-6aeab3d1106e] received
2025-01-03 11:15:05.049 INFO [Dummy-11] [ops_trace_task.py:48] - Processing trace tasks success, app_id: 9a71b7b7-7c8b-407b-b92b-93d425b16157
2025-01-03 11:15:05.108 INFO [Dummy-11] [trace.py:128] - Task tasks.ops_trace_task.process_trace_tasks[fbf14db5-c761-4fb6-a1e1-6aeab3d1106e] succeeded in 0.40471500158309937s: None
2025-01-03 11:15:42.626 INFO [MainThread] [strategy.py:161] - Task tasks.ops_trace_task.process_trace_tasks[d0609a4a-a704-446c-a745-c6795f73e8d5] received
2025-01-03 11:15:42.911 INFO [Dummy-14] [ops_trace_task.py:48] - Processing trace tasks success, app_id: 9a71b7b7-7c8b-407b-b92b-93d425b16157
2025-01-03 11:15:43.651 INFO [Dummy-14] [trace.py:128] - Task tasks.ops_trace_task.process_trace_tasks[d0609a4a-a704-446c-a745-c6795f73e8d5] succeeded in 1.0220515485852957s: None
2025-01-03 11:16:50.840 INFO [MainThread] [strategy.py:161] - Task tasks.ops_trace_task.process_trace_tasks[65715d29-f806-449e-82b6-671011214006] received
2025-01-03 11:16:51.130 INFO [Dummy-17] [ops_trace_task.py:48] - Processing trace tasks success, app_id: 9a71b7b7-7c8b-407b-b92b-93d425b16157
2025-01-03 11:16:51.219 INFO [Dummy-17] [trace.py:128] - Task tasks.ops_trace_task.process_trace_tasks[65715d29-f806-449e-82b6-671011214006] succeeded in 0.377222154289484s: None
2025-01-03 11:17:16.004 INFO [MainThread] [strategy.py:161] - Task tasks.ops_trace_task.process_trace_tasks[cf306ca0-5869-4de3-9bfa-84e1a1f635a3] received
2025-01-03 11:17:16.338 INFO [Dummy-20] [ops_trace_task.py:48] - Processing trace tasks success, app_id: 9a71b7b7-7c8b-407b-b92b-93d425b16157
2025-01-03 11:17:16.411 INFO [Dummy-20] [trace.py:128] - Task tasks.ops_trace_task.process_trace_tasks[cf306ca0-5869-4de3-9bfa-84e1a1f635a3] succeeded in 0.40508390963077545s: None
2025-01-03 11:23:41.830 INFO [Dummy-22] [gossip.py:145] - missed heartbeat from celery@dify-worker-6cb6dc8697-bp8q2
2025-01-03 11:24:56.838 INFO [Dummy-23] [gossip.py:145] - missed heartbeat from celery@dify-worker-6cb6dc8697-fwnsq
2025-01-03 11:32:51.891 INFO [Dummy-24] [gossip.py:145] - missed heartbeat from celery@dify-worker-6cb6dc8697-wr8qj
2025-01-03 11:34:45.596 INFO [MainThread] [strategy.py:161] - Task tasks.ops_trace_task.process_trace_tasks[c09875e9-ff7d-45d8-bbca-47ee82b97746] received

the screenshot of flower

1735875472787

the workers which got task was all offline and only the workers which not got task was online.

✔️ Expected Behavior

the workers should always be online to accept the task to build the index dict.

❌ Actual Behavior

the workers which got tasks become offline and cannot recover to accept the new task.

@dosubot dosubot bot added the 🐞 bug Something isn't working label Jan 3, 2025
@crazywoola
Copy link
Member

If you are interested in implementing this issue, please feel free to open a pull request.

@crazywoola crazywoola added the good first issue Good first issue for newcomers label Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug Something isn't working good first issue Good first issue for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants