Replies: 8 comments 2 replies
-
Thanks for opening your first issue here! Be sure to follow the issue template! |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
there is not enough information to diagnose/do something about it. I guess the problem might be somewhere on the resouce level - and my guess will be memory - celery tasks swapping out to disk or similar. I think more checking on EKS level and possibly peeking into running instnace to see what process is taking up the resources and what it is doing is needed. I do not recall any similar problem in Airlfow, so this is very likely some problem with the deployment, |
Beta Was this translation helpful? Give feedback.
-
I'm experiencing this exact same problem. Did you ever end up solving this problem? edit: are you perchance also using |
Beta Was this translation helpful? Give feedback.
-
I'm experiencing a similar problem. any workaround? |
Beta Was this translation helpful? Give feedback.
-
@ankitmon -> @mikkelam -> As explained above, we have not enough evidences to reproduce and diagnose the issue. I suggest you gather more evidences and create proper bug report with enough information that could help to diagnose and find the root cause. This might or might not be the same problem. Any workarounds might or might not work for you - but if you can do more investigations, gather more evidences - abnormal logs, circumstances and some ways how your problem can be easily reproduced, I heartily invite you to open an issue where you would gather those evidences. This is the best way you can help us to help you with the problem you have. |
Beta Was this translation helpful? Give feedback.
-
It seems to me that this is related to task stalling in celery, so, if you are having this issue, experiment with setting the below environment variable: |
Beta Was this translation helpful? Give feedback.
-
As a workaround - we increased the machines from c5.xlarge to c5.2xlarge. We still have the same number of workers (6) but they got more memory. |
Beta Was this translation helpful? Give feedback.
-
Apache Airflow version
2.3.3 (latest released)
What happened
To be honest I don't know why it stopped working properly.
In our process we have 2 DAGs per client, first DAG have 3 tasks, second one have 5-8 tasks. In general first DAG should take ~3min, second one ~5-10min to finish. A week ago we've added 2 new clients with similar amount of data than previous customers and airflow started to behave strangely.
Dags (different ones, not only for those 2 customers) are in the
running
state for hours (but all tasks inside are already finished after few minutes after start, but worker is doing "something" which is not in the logs and causing high load ~12, when in normal conditions we have < 1). Or dag is inrunning
state, task can havequeued
(orno_status
) status for hours.We've mitigated that issue with restarting workers and schedulers on every hour, but it's not a longterm or midterm solution.
We're using CeleryExecutors (in the kubernetes - 1 pod = 1 worker). It's not helping if we change concurrency from 4 to 1 for example. On the worker pod on the process list we have only celery, gunicorn and current task.
We had
apache/airflow:2.2.5-python3.8
but right now it'sapache/airflow:2.3.3-python3.8
with the same problems.What you think should happen instead
No response
How to reproduce
No response
Operating System
Debian GNU/Linux 11 (bullseye) (on pods), amazon-linux on EKS
Versions of Apache Airflow Providers
$ celery --version 5.2.7 (dawn-chorus) $ pip freeze | grep -i flower flower==1.1.0
Deployment
Other 3rd-party Helm chart
Deployment details
Airflow-scheduler, web, workers, redis are on our EKS, deployed via our own helm charts.
We have also RDS (postgresql).
Anything else
Are you willing to submit PR?
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions