Replies: 52 comments
-
It looks like you are actually using SequentialExecutor. That would perfectly explain the behaviour. Are you sure you are using local executor and scheduler is running ? Can you run |
Beta Was this translation helpful? Give feedback.
-
In this screenshot the scheduler is running 4 of the same process / task, because As stated above, the issue is airflow will not run other dags and the scheduler is not responding. (Strangely, the scheduler is apparently also quite happy to run 1 task from 1 dag in 4 parallel processes.) I suspect some value in the configuration or not enough database connections. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
It does not look like like Airflow problem to be honest. It certainly looks like your long running task blocks some resources that blocks scheduler somehow (but there is no indication how it can be blocked). There must be something in your DAGs or task that simply causes the Airflow scheduler to lock up. This is almost certainly something specific to your deployment (others do not experience it). But what it is, it's hard to say from the information provided. My best guess is that you have some lock on the database and it makes scheduler wait while running some query. Is it possible to dump the state of scheduler and dump generaly more of the state of your machine, resouces, sockets, DB locks while it happens (this shoudl be possible with py-spy for example). Also getting all logs of scheduler and seing if it actually "does" something might help. Unfortunately the information we have now is not enough to deduce the reason. Any insight of WHERE the schduler is locked might help with investigating it. |
Beta Was this translation helpful? Give feedback.
-
BTW. Yeah, checking the limits of connections opened in your DB might be a good idea. Especially if you are using variables in your DAGs at top level, it MIGHT lead to a significant number of connections opened, which *might eventually cause scheduler to try to open a new connection and patiently wait until the DB server will have any connection free. It might simply be that your long running tasks are written in the way that they (accidentally) open a lot of those connections and do not close them until the task completes. I think PGBouncer might help with that, but if too many connections are opened by a single long running task and they are not closed, that might also not help too much. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
I think you are misusing Airlfow. Airlfow by definition can run multiple tasks even with single local executor, so you are likely misunderstanding how airflow operators are run and run it as a top-level code, and not as task? Can you please copy your DAG here? You are not supposed to run long running operations when the DAG file is parsed - parsing should be run rather quickly and from what it looks like you execute a long running process while parsing happens rather than when tasks are executed: https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
I'm not sure how airflow is intended to be used, but sometimes people find other use cases for a tool they haven't designed. We run a task that can take a few hours to collect all the historical data and process it. And then we want the task to run once per day. It appears, from my side, that the airflow server UI can't contact the scheduler while the long task is running, and other DAGs can't be run. Perhaps the scheduler wants my code to yield control back to it frequently (once per day of data, for example), but I prefer to let my own code manage the date ranges, because that's where the unit tests are, and all the heavy lifting is in rust anyway. |
Beta Was this translation helpful? Give feedback.
-
What I do find use for in airflow
|
Beta Was this translation helpful? Give feedback.
-
This is what airflow is designed for. I tihnk you just use it wrongly (or misconfigured it). It is supposed to handle that case perfectly (and it works this way for thousands of users. So it's your configuration/setup/way of using it is wrong.
No. This is no the case (unless you use Sequential Executor which is only suposed to be used for debugging). . Airflow is designed to run multiple paralllel tass at a time:. You likely have some problem in your airlfow installation/configuration Questions:
I just run it in 2.2.3 and I was able to successuly start even 5 paralllel runs and no problems with Scheduler |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
I'm sorry, I didn't intend this to turn into an unpaid debugging session, and it looks like a cosmetic problem more than anything. So I'm comfortable closing this thread if you prefer. |
Beta Was this translation helpful? Give feedback.
-
I think the max_tis_per_query is quite high, but even with it, it is suspicious to see that 512 tis are processed in 594 seconds. Some of the queries must simply run for a very long time. Is it possible to get some stats from Postgres on what are the longest running queries @deepchand ? There are a number of guides in the internet - for example this one that show how to do it: https://www.shanelynn.ie/postgresql-find-slow-long-running-and-blocked-queries/ |
Beta Was this translation helpful? Give feedback.
-
I wonder if this is related to the locking issue we just resolved recently. |
Beta Was this translation helpful? Give feedback.
-
Do you remember which one @uranusjr ? |
Beta Was this translation helpful? Give feedback.
-
This one ? #25532 |
Beta Was this translation helpful? Give feedback.
-
I have tried to debug it more and found its |
Beta Was this translation helpful? Give feedback.
-
Yes that one |
Beta Was this translation helpful? Give feedback.
-
@potiuk i have debugged more on this issue and found like updating the task state in the scheduler loop significantly increase the total time of self.executor.heartbeat() function which is causing this problem airflow/airflow/executors/celery_executor.py Line 312 in 9ac7428 "The scheduler does not appear to be running. Last heartbeat was received X minutes ago" Please have a look and let me know if any other info is needed |
Beta Was this translation helpful? Give feedback.
-
I do not know that well, but for me it looks like you have some bottlences with your Celery Queu. You have not specified what kind of queue you used - but I think you should look at your Redis or RabbitMQ and see if there are no problems there. Also it might be that simply your Redis or RabbitMQ is badly configured or overloaded and it is somehow blocking state update. Can you please set log level to debug and see if there are more information printed on what's going on in update_state method? (you will find how to do it in airlfow configuration docs) |
Beta Was this translation helpful? Give feedback.
-
@potiuk We are using Redis as queue, i have cross verified redis is not bottleneck and messages are consumed from redis as soon as they land in redis queue. We have enough resources available on redis side as well. |
Beta Was this translation helpful? Give feedback.
-
Any chance for debug logs ? |
Beta Was this translation helpful? Give feedback.
-
Debug logs of scheduler ? |
Beta Was this translation helpful? Give feedback.
-
Yep:
|
Beta Was this translation helpful? Give feedback.
-
@potiuk I have added some more debug logs to find the total time taken while fetching state of a single result and for total results in a single loop, please find below |
Beta Was this translation helpful? Give feedback.
-
Too much debug logs so tried to filter out only important one, let me know if more info needed |
Beta Was this translation helpful? Give feedback.
-
I think your problem is simply extremely slow connection to your database. 5 seconds to run single query indicate a HUGE problem you have with your database. It should take single mlilliseconds . THIS IS your problem. Not airflow. You should fix your DB/connectivity and debug why your database is 1000x slower than it should. |
Beta Was this translation helpful? Give feedback.
-
Apache Airflow version
2.1.4
Operating System
Linux / Ubuntu Server
Versions of Apache Airflow Providers
apache-airflow-providers-ftp==2.0.1
apache-airflow-providers-http==2.0.1
apache-airflow-providers-imap==2.0.1
apache-airflow-providers-postgres==2.3.0
Deployment
Virtualenv installation
Deployment details
Airflow v2.1.4
Postgres 14
LocalExecutor
Installed with Virtualenv / ansible - https://github.com/idealista/airflow-role
What happened
I run a single BashOperator (for a long running task, we have to download data for 8+ hours initially to download from the rate-limited data source API, then download more each day in small increments).
We're only using 3% CPU and 2 GB of memory (out of 64 GB) but the scheduler is unable to run any other simple task at the same time.
Currently only the long task is running, everything else is queued, even thought we have more resources:
What you expected to happen
I expect my long running BashOperator task to run, but for airflow to have the resources to run other tasks without getting blocked like this.
How to reproduce
I run a command with bashoperator (I use it because I have python, C, and rust programs being scheduled by airflow).
bash_command='umask 002 && cd /opt/my_code/ && /opt/my_code/venv/bin/python -m path.to.my.python.namespace'
Configuration:
Anything else
This occurs every time consistently, also on 2.1.2
The other tasks have this state:
When the long-running task finishes, the other tasks resume normally. But I expect to be able to do some parallel execution /w LocalExecutor.
I haven't tried using pgbouncer.
Are you willing to submit PR?
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions