Need help understanding total number of Dags oscillating on UI #44495

rtrindvg · 2024-11-30T02:08:59Z

rtrindvg
Nov 30, 2024

I am in the middle of a migration from an Airflow in a Virtual Machine to a kubernetes cluster, for now in a staging environment. After a lot of configuration adjustments on the Helm values.yaml, the cluster seems to be stable and working fine.

But for some reason, the UI sometimes shows less DAGs than what are available. For example, we have a total of 93 DAGs. After the initial load, which takes a couple of minutes, it becomes stable for some time. Than it reduces to a smaller number (like 64) and after a couple of minutes, it starts to go back up again, eventually returning to 93 again. We confirmed this is not any kind of browser cache. There were no restarts of any pods in the meantime, no changes to the cluster and no DAGs were changes as well.

We are using git-sync in a non-persistent storage, like it's recommended in the docs. We activated the debug logs on it and it seems to be working fine, just downloading changes when the DAGs branch has changes, and they seem to be propagating quickly to all relevant pods.

The scheduler logs were not clear on any kind of errors which could justify the drop in total DAGs, except the following line:

[2024-11-30T02:10:03.298+0000] {scheduler_job_runner.py:1782} INFO - Found (8) stales dags not parsed after 2024-11-30 02:00:03.296796+00:00.

I am researching if this is relevant to the issue at hand, but unsuccessfully so far.

Another fix we tried was activating the non-default DAG processor, but the behavior is the same. I tried activating the processor verbose mode using the env parameter, unsuccessfully. The logs are mostly blank, so I have no clue if the DAG processor is the culprit.

We also replaced the CeleryExecutor to the KubernetesExecutor, because it is more suited to our purposes. We did not think it had any relation to the issue and, as expected, the behavior persists.

Since I am from the cloud-infra team, and have no previous experience in Airflow, can someone help me understand what could be the issue and possible next steps in diagnosing our environment?

We are using airflow 2.9.3 (since it's the most recent in the latest Helm available), python 3.12, in a custom Dockerfile. We are not extending the image, we are really customizing it, since we need to perform a couple of compilations and it was more optimal to do this prior to the airflow pip installs, to make the rebuild faster and the final image smaller. I did not know if it was safe to point the image to the latest Airflow available (since I assume an updated Helm would be published if this was the case), so we kept using it this one.

Embedding the DAGs onto the image is not an option, since they are changed constantly and the time to rebuild and the process of redeploying the cluster several times a day is not ideal for us. If updating the cluster to 2.10.3 is safe and has any known issues regarding this behavior, please point me in the right way.

Thanks for any tips!

Answered by potiuk

Nov 30, 2024

[2024-11-30T02:10:03.298+0000] {scheduler_job_runner.py:1782} INFO - Found (8) stales dags not parsed after 2024-11-30 02:00:03.296796+00:00.

That line is important. It tells that dag processor, did not parse some of your DAGs within https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#dag-stale-not-seen-duration time (10 minutes) - the timestamps in your log file confirm that - simply 10 minutes passed since the parsing of the DAG files produced the 8 DAG ids that were deactivated.

How you can prepare for investigation

Now. The question is why and we need you to investigate several things to find out.

First of all - find out which dags disappeared (got deacti…

View full answer

potiuk · 2024-11-30T11:51:23Z

potiuk
Nov 30, 2024
Collaborator

[2024-11-30T02:10:03.298+0000] {scheduler_job_runner.py:1782} INFO - Found (8) stales dags not parsed after 2024-11-30 02:00:03.296796+00:00.

That line is important. It tells that dag processor, did not parse some of your DAGs within https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#dag-stale-not-seen-duration time (10 minutes) - the timestamps in your log file confirm that - simply 10 minutes passed since the parsing of the DAG files produced the 8 DAG ids that were deactivated.

How you can prepare for investigation

Now. The question is why and we need you to investigate several things to find out.

First of all - find out which dags disappeared (got deactivated) - that will make further investigations easier.. Those are DAGs in the DAG table of airflow that will have "active = False". This is what is going to change in your case - so all the dags that are "non-active" - will disappear from the UI.
The important thing here is that there is no 1-1 relationship between DAG files and DAG ids. Sometimes (Dynamic DAG generation) - parsing one DAG file might produce more than one DAG id. This is Dynamic DAG Generation. So you need to find out which of the FILES in DAG folder should have produced the DAG ids that got deactivated.
Once you know that, there are a number of things that could go wrong.

Things that could be wrong

a) Your Dynamic DAG generation in your DAGs is buggy / unstable and produces different set of DAG ids every time it is parsed.

Generally speaking, when you do Dynamic DAG generation, every time a DAG file is parsed, it shoudl produce consistently the same dag ids. Again - it is entirely up to you how the DAGs are written. There are different techniques you can use for dynamic DAG generation https://airflow.apache.org/docs/apache-airflow/stable/howto/dynamic-dag-generation.html#dynamic-dags-with-external-configuration-from-a-structured-data-file - and it might be that the way how you do it is simply unstable. For example pulls information about the various DAGs generated as Json file and that file content changes in unpredictable / non stable ways. Or there is a bug in the DAG generation code that causes an exception sometimes and not all DAGs are generated. Generally - when you do python your_dag_file.py - parsing it should consitently create the same number of DAG "objects" created in python globals with the same ids.

This is most probable cause.

b) some of your files are not parsed for some reasons. Generally airflow DAG file processor is parsing continuously all files and should go through all the files in a loop, but there are certain parameters that control it:

https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#file-parsing-sort-mode - determines what is the sorting criteria used.
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#file-parsing-sort-mode - this is how often we check if certain dags were last parsed to see if the https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#dag-stale-not-seen-duration has been exceeded,
This is how many parallel parsing processes are run by each scheduler (or dag file process) if you have standalone dag file processor) https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#parsing-processes

There are few other parameters, but those are the important ones.

c) Your DAGs do notfolow best practices.

So one of other options is that simply parsing of some of your DAG files takes a long time - long enough that it takes more than 1- minutes for the dag file processor to go in a look to parse all your files. If you follow best practices https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#reducing-dag-complexity - and make sure that you do not block or spend a lot of time in parsing your top-level code of DAG files https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code - parsing of even 100s of files should take seconds at most. But there are cases - especially when your top-level code is reaching out to external sources and possibly "hangs" or takes a lot of time to complete parsing, your parsing can take arbitrary long - minutes or hours.

If you have some of your DAG files doing it, they might simply hang for a long time - if you have 2 parallel dag parsing processes running - it's enough that you have two files where parsing takes around 10 minutes and that will cause some of the remaining files to not be parsed "on time". Similarly if you have many DAG files that are parsed in few minutes each - that might delay the queue enough, that some of your DAGs will not parse within default 10 minutes

You need to make sure to optimize your parsing time and follow best practices - ideally, all your DAGs should be parsed in seconds rather than minutes. Of course you can also play with the parameters above, and make timeouts bigger, but that is rather masking the "long parsing being problematic" rather than solve the problem and will results in for example far longer times on reflecting dag file changes into parsed DAGs in DB.

You can control DAG parsing timeout https://airflow.apache.org/docs/apache-airflow/stable/faq.html#how-to-control-dag-file-parsing-timeout-for-different-dag-files and the next section https://airflow.apache.org/docs/apache-airflow/stable/faq.html#when-there-are-a-lot-1000-of-dag-files-how-to-speed-up-parsing-of-new-files explains some of the ways you can attempt speed up parsing.

This is the second most probable cause for what you see.

d) your syncing process might have some on/off states where the files are appearing / disappearing after check-out.

Git-sync works in the way that it checks out the latest commit and then swaps out with previous check-out via symbolic link. Maybe there is
somethign in that process that causes it? Maybe for example ther are some permission problems that prevent the files to be parsed etc. etc.

Not very likely though

e) it might be that some of the combination of the parameters above and your synchronization settings above causes bad sorting of parsed files

It might simply cause that the processors are not parsing your files - there might be various reasons - for example if your sorting order is modified time and for some reason your syncing process causes mtime to be modified continuously, it might well be that dag processor will only ever attempt to parse "last modified" files and your "non-modified files" will be always put at the end of the queue - that might cause similar issues - you must look for an indication (in dag file processor logs) what files are being parsed. This might also happen if the filesystem of yours badly handles mtime coming from git - or when your git is configured to not preserve modified time that is coming from git.

This is also quite unlikely reason, but if you use non POSIX compliant shared filesystems, i can imagine it can happen.

f) Finally - time on your various machines might not be synchronized.

If one of the machines (DB, scheduler, processor, git) has a significant drift of time, it might be that causes various time calculation wrong.

Not very likely, most of the computing resources out there have ntp and similar way of syncing time, but we've seen that happening.

Good luck with your investigation and please come back here with the results.

1 reply

rtrindvg Dec 1, 2024
Author

This description is a godsend, @potiuk, thank you so much for the time to put all this relevant information in a single topic! Most of what you say here I read in several different places of the documentation, but have it all here in a single explanation is just mega helpful and I appreciate it a lot.

I will pass this information to the team at Monday and will come back with the results. I am certain with this information we will be able to solve the issue and I will report back my findings here, to help more people which may fall in the same scenario.

potiuk · 2024-11-30T12:27:32Z

potiuk
Nov 30, 2024
Collaborator

As explained in #39332

@vghar-bh
@raphaelauv
@AlexHAM
@rtrindvg

Please follow the above guide and explain results of your investigation that I outlined - that might help us with maybe getting better messaging and explanation on what would go wrong with a number of deployments - or maybe we will eventually find some problem in Airflow that might cause it.

I'd really appreciate a dilligent reading and understanding of how things work and how it maps to your deployment, because this is the only way we can get to the bottom of the issues you have with disappearing DAGs - that might help us to build more resilience, messaging and possibly even find some bugs in Airflow that might help other users having similar issues.

Your help here and dilligence is greatly appreciated.

1 reply

rtrindvg Dec 1, 2024
Author

In my specific case, I am not the main user of the airflow tool, just an infra guy, so in my case, I do not put blame in the docs at all. I am parsing through the documents for the last couple of weeks, but always focused of changing a specific behavior of the platform.

From what I could feel, airflow docs are very good and I had no difficulty customizing the image, deploying the official Helm file and changing configurations to reach the desired outcome. My main difficulty was the time it took to find the line which could explain the problem. I just found it yesterday, and a quick google did not bring any relevant results right away, but I am sure with more time we could reach the solution, eventually.

potiuk · 2024-11-30T12:33:29Z

potiuk
Nov 30, 2024
Collaborator

BTW. I am plannig to turn that above explanation of mine into "What to do if you see disappearing DAGs" FAQ on our website, so any feedback to that description is also most welcome.

1 reply

rtrindvg Dec 1, 2024
Author

I think it makes a lot sense. It would help cases like mine right away and although I won't have a chance to try the suggestions until monday, I am sure it will already help a lot of people in the way it is structured right now.

Thanks again for all the help!

rtrindvg · 2024-12-09T21:47:18Z

rtrindvg
Dec 9, 2024
Author

Found the issue in our case. It was option "c". After some discussion with the data team, which is the one which creates the DAGs, it was clear no best practices were being used. Besides large imports (which did not need to happen during the parse stage), there were some costly operations happening, like connecting to a vault to get secure parameters and also some database connections. This multiplied by the number of affected DAGs clearly explained the observed behavior.

Thanks again for the detailed post which made my life a lot easier for them to understand it was not an infrastructure issue.

3 replies

potiuk Dec 9, 2024
Collaborator

Thanks again for the detailed post which made my life a lot easier for them to understand it was not an infrastructure issue.

Cool. Thanks! so now it's the time to turn it into a FAQ... Or maybe you could help with that and become contributor @rtrindvg ? It's very easy - go to the page with FAQS https://airflow.apache.org/docs/apache-airflow/stable/faq.html - click "suggest a change on this page" and it will open PR where you could turn my answer into a FAQ entry and lead it to completion (likely some typos/ .rst formatting etc. )

Maybe that's a good form of "thanks" and also paving the path for future people like you :)

WDYT ?

rtrindvg Dec 9, 2024
Author

Mission accepted!

Probably won't be done quickly because of some personal challenges and holidays, but as soon as I am able I will open this PR. Since I am not a native speaker, I will kindly ask for someone to later review the text and make sure everything is as it should be.

potiuk Dec 9, 2024
Collaborator

I am not a native speaker myself and a lot of what I write is corrected by others - you can count on that. We have a few grammar-conscious people here. Also docs are verified via spell-checker :).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need help understanding total number of Dags oscillating on UI #44495

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Need help understanding total number of Dags oscillating on UI #44495

rtrindvg Nov 30, 2024

How you can prepare for investigation

Replies: 4 comments · 6 replies

potiuk Nov 30, 2024 Collaborator

How you can prepare for investigation

Things that could be wrong

rtrindvg Dec 1, 2024 Author

potiuk Nov 30, 2024 Collaborator

rtrindvg Dec 1, 2024 Author

potiuk Nov 30, 2024 Collaborator

rtrindvg Dec 1, 2024 Author

rtrindvg Dec 9, 2024 Author

potiuk Dec 9, 2024 Collaborator

rtrindvg Dec 9, 2024 Author

potiuk Dec 9, 2024 Collaborator

rtrindvg
Nov 30, 2024

Replies: 4 comments 6 replies

potiuk
Nov 30, 2024
Collaborator

rtrindvg Dec 1, 2024
Author

potiuk
Nov 30, 2024
Collaborator

rtrindvg Dec 1, 2024
Author

potiuk
Nov 30, 2024
Collaborator

rtrindvg Dec 1, 2024
Author

rtrindvg
Dec 9, 2024
Author

potiuk Dec 9, 2024
Collaborator

rtrindvg Dec 9, 2024
Author

potiuk Dec 9, 2024
Collaborator