Dags stucks on the "running" state #25483

noose · 2022-08-01T09:42:32Z

noose
Aug 1, 2022

Apache Airflow version

2.3.3 (latest released)

What happened

To be honest I don't know why it stopped working properly.
In our process we have 2 DAGs per client, first DAG have 3 tasks, second one have 5-8 tasks. In general first DAG should take ~3min, second one ~5-10min to finish. A week ago we've added 2 new clients with similar amount of data than previous customers and airflow started to behave strangely.
Dags (different ones, not only for those 2 customers) are in the running state for hours (but all tasks inside are already finished after few minutes after start, but worker is doing "something" which is not in the logs and causing high load ~12, when in normal conditions we have < 1). Or dag is in running state, task can have queued (or no_status) status for hours.
We've mitigated that issue with restarting workers and schedulers on every hour, but it's not a longterm or midterm solution.

We're using CeleryExecutors (in the kubernetes - 1 pod = 1 worker). It's not helping if we change concurrency from 4 to 1 for example. On the worker pod on the process list we have only celery, gunicorn and current task.

We had apache/airflow:2.2.5-python3.8 but right now it's apache/airflow:2.3.3-python3.8 with the same problems.

What you think should happen instead

No response

How to reproduce

No response

Operating System

Debian GNU/Linux 11 (bullseye) (on pods), amazon-linux on EKS

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==4.0.0
apache-airflow-providers-celery==3.0.0
apache-airflow-providers-cncf-kubernetes==4.1.0
apache-airflow-providers-docker==3.0.0
apache-airflow-providers-elasticsearch==4.0.0
apache-airflow-providers-ftp==3.0.0
apache-airflow-providers-google==8.1.0
apache-airflow-providers-grpc==3.0.0
apache-airflow-providers-hashicorp==3.0.0
apache-airflow-providers-http==3.0.0
apache-airflow-providers-imap==3.0.0
apache-airflow-providers-microsoft-azure==4.0.0
apache-airflow-providers-mysql==3.0.0
apache-airflow-providers-odbc==3.0.0
apache-airflow-providers-postgres==5.0.0
apache-airflow-providers-redis==3.0.0
apache-airflow-providers-sendgrid==3.0.0
apache-airflow-providers-sftp==3.0.0
apache-airflow-providers-slack==5.0.0
apache-airflow-providers-sqlite==3.0.0
apache-airflow-providers-ssh==3.0.0

$ celery --version
5.2.7 (dawn-chorus)

$ pip freeze | grep -i flower
flower==1.1.0

Deployment

Other 3rd-party Helm chart

Deployment details

Airflow-scheduler, web, workers, redis are on our EKS, deployed via our own helm charts.

We have also RDS (postgresql).

Anything else

  ____________       _____________
 ____    |__( )_________  __/__  /________      __
____  /| |_  /__  ___/_  /_ __  /_  __ \_ | /| / /
___  ___ |  / _  /   _  __/ _  / / /_/ /_ |/ |/ /
 _/_/  |_/_/  /_/    /_/    /_/  \____/____/|__/
[2022-08-01 08:42:48,042] {{scheduler_job.py:708}} INFO - Starting the scheduler
[2022-08-01 08:42:48,042] {{scheduler_job.py:713}} INFO - Processing each file at most -1 times
[2022-08-01 08:42:48,238] {{executor_loader.py:105}} INFO - Loaded executor: CeleryExecutor
[2022-08-01 08:42:48,243] {{manager.py:160}} INFO - Launched DagFileProcessorManager with pid: 29
[2022-08-01 08:42:48,245] {{scheduler_job.py:1233}} INFO - Resetting orphaned tasks for active dag runs
[2022-08-01 08:42:48,247] {{settings.py:55}} INFO - Configured default timezone Timezone('Europe/Berlin')
/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/log/file_task_handler.py:52 DeprecationWarning: Passing filename_template to FileTaskHandler is deprecated and has no effect
[2022-08-01 08:42:48,330] {{celery_executor.py:532}} INFO - Adopted the following 1 tasks from a dead executor
        <TaskInstance: uploads_customer_xxx_v5.calculation_1 custom__2022-07-30 17:13:39+00:00_1 [queued]> in state STARTED

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

2022-08-01T09:42:33Z

boring-cyborg[bot]
bot Aug 1, 2022

Thanks for opening your first issue here! Be sure to follow the issue template!

0 replies

em-eman · 2022-08-01T10:18:15Z

em-eman
Aug 1, 2022

Just to add more insights for current issue which is created by my colleague :

There are two types of dags where dag one act as a internal scheduler which runs with 1 min heartbeat to check for files in s3 once the file is arrived in s3 then it will trigger the second dag to do data processing and other types of pythpn operator in our pipeline .
Dag One (scheduler) -> to check file in s3 and trigger dag two

Dag Two (upload dag)

Problem happens in second dag where dag is stuck on success state for any of the above task while in flower (celery worker) task is active/running state forever. Seems like issue is between scheduler to worker communication for task state

0 replies

potiuk · 2022-08-02T18:32:49Z

potiuk
Aug 2, 2022
Collaborator

there is not enough information to diagnose/do something about it. I guess the problem might be somewhere on the resouce level - and my guess will be memory - celery tasks swapping out to disk or similar. I think more checking on EKS level and possibly peeking into running instnace to see what process is taking up the resources and what it is doing is needed.

I do not recall any similar problem in Airlfow, so this is very likely some problem with the deployment,

0 replies

mikkelam · 2022-08-29T10:33:06Z

mikkelam
Aug 29, 2022

I'm experiencing this exact same problem. Did you ever end up solving this problem?

edit: are you perchance also using max_active_runs?

1 reply

mikkelam Sep 19, 2022

My workaround/hacky short term solution is to go in and manually delete all tasks that are stuck in this state, and the scheduler usually revives

ankitmon · 2022-09-13T08:41:27Z

ankitmon
Sep 13, 2022

I'm experiencing a similar problem. any workaround?

0 replies

potiuk · 2022-09-13T09:54:24Z

potiuk
Sep 13, 2022
Collaborator

@ankitmon -> @mikkelam -> As explained above, we have not enough evidences to reproduce and diagnose the issue.

I suggest you gather more evidences and create proper bug report with enough information that could help to diagnose and find the root cause. This might or might not be the same problem. Any workarounds might or might not work for you - but if you can do more investigations, gather more evidences - abnormal logs, circumstances and some ways how your problem can be easily reproduced, I heartily invite you to open an issue where you would gather those evidences. This is the best way you can help us to help you with the problem you have.

0 replies

ephraimbuddy · 2022-09-13T10:29:41Z

ephraimbuddy
Sep 13, 2022
Collaborator

It seems to me that this is related to task stalling in celery, so, if you are having this issue, experiment with setting the below environment variable:
AIRFLOW__CELERY__STALLED_TASK_TIMEOUT=600

0 replies

noose · 2022-09-19T10:29:06Z

noose
Sep 19, 2022
Author

As a workaround - we increased the machines from c5.xlarge to c5.2xlarge. We still have the same number of workers (6) but they got more memory.
At the same time, the graphana did not show that we were approaching the limit of RAM on the machine. It didn't seem to have some memory leak...

1 reply

potiuk Sep 19, 2022
Collaborator

Tracking lack of memmory might be tricky. When you are looking at what Kubernetes Does it might simply not want to launch new Pods when there are not enough resources or when one task takes more and then release the memory it might not be reused because the PODS might be differently dirstributed among nodes, they can take a lot of memory for unix cache etc. Also it very much depends which kind of memory you track. Linux has several different kinds of memory and some of this memory is shared among the process, some of them is no some of the memory is immediately freeable, some of it is not.

Additionally the memory can be fragmented https://savvinov.com/2019/10/14/memory-fragmentation-the-silent-performance-killer/ and this might have some impact especially if processes you run require large continuous RAM blocks.

So you might not necessarily see "obvious" reasons from just plain grafana memory monitoring. but if increasing the memory available helped, then likely you needed it in the first place, so this is not a workaround, but most likely proper solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dags stucks on the "running" state #25483

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Dags stucks on the "running" state #25483

noose Aug 1, 2022

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

Replies: 8 comments · 2 replies

boring-cyborg[bot] bot Aug 1, 2022

em-eman Aug 1, 2022

potiuk Aug 2, 2022 Collaborator

mikkelam Aug 29, 2022

mikkelam Sep 19, 2022

ankitmon Sep 13, 2022

potiuk Sep 13, 2022 Collaborator

ephraimbuddy Sep 13, 2022 Collaborator

noose Sep 19, 2022 Author

potiuk Sep 19, 2022 Collaborator

noose
Aug 1, 2022

Replies: 8 comments 2 replies

boring-cyborg[bot]
bot Aug 1, 2022

em-eman
Aug 1, 2022

potiuk
Aug 2, 2022
Collaborator

mikkelam
Aug 29, 2022

ankitmon
Sep 13, 2022

potiuk
Sep 13, 2022
Collaborator

ephraimbuddy
Sep 13, 2022
Collaborator

noose
Sep 19, 2022
Author

potiuk Sep 19, 2022
Collaborator