Replies: 1 comment 1 reply
-
I think it depsnds on what really teh "dead" executor mean. There are certain problems (in general in computer science) that are not possible to be handled - running out of memory, some deadlocks might not be able to be detected (and some heartbeats might even work if the worker is frozen). I think if you can find out what it really was that the "workers" were "dead" and adding some custom Health checks on workers/redis might be a good idea. I do not think we have anything there in the Chart. PRs are most welcome. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, I came accross a problem on Airflow 2.3.3 running Redis broker, all deployed to Kubernetes with the official Helm chart.
In summary, we had multiple tasks that were marked as queued without any sign of being scheduled. Restarting the scheduler or workers would only show this error:
[2022-08-10 16:28:16,314] {base_executor.py:215} ERROR - could not queue task TaskInstanceKey(dag_id='', task_id='', run_id='scheduled__2022-07-15T07:00:00+00:00', try_number=1, map_index=-1) (still running after 4 attempts)
According to the evidence that was collected. There was a dead celery executor that was holding multiple tasks from different DAGs. What was needed to fix the problem was to restart redis and restart the scheduler and workers. After this, I could see the scheduler could finally pick up on the failed tasks.
[2022-08-10 16:32:29,205] {celery_executor.py:532} INFO - Adopted the following 13 tasks from a dead executor
<TaskInstance: *. scheduled__2022-07-15T07:00:00+00:00 [queued]>
<TaskInstance: *. scheduled__2022-07-15T07:00:00+00:00 [queued]>
Any tip or setting to avoid this issue? I would think that a custom healthcheck on Redis could prevent this issue.
Beta Was this translation helpful? Give feedback.
All reactions