-
I am in the middle of a migration from an Airflow in a Virtual Machine to a kubernetes cluster, for now in a staging environment. After a lot of configuration adjustments on the Helm values.yaml, the cluster seems to be stable and working fine. But for some reason, the UI sometimes shows less DAGs than what are available. For example, we have a total of 93 DAGs. After the initial load, which takes a couple of minutes, it becomes stable for some time. Than it reduces to a smaller number (like 64) and after a couple of minutes, it starts to go back up again, eventually returning to 93 again. We confirmed this is not any kind of browser cache. There were no restarts of any pods in the meantime, no changes to the cluster and no DAGs were changes as well. We are using git-sync in a non-persistent storage, like it's recommended in the docs. We activated the debug logs on it and it seems to be working fine, just downloading changes when the DAGs branch has changes, and they seem to be propagating quickly to all relevant pods. The scheduler logs were not clear on any kind of errors which could justify the drop in total DAGs, except the following line:
I am researching if this is relevant to the issue at hand, but unsuccessfully so far. Another fix we tried was activating the non-default DAG processor, but the behavior is the same. I tried activating the processor verbose mode using the env parameter, unsuccessfully. The logs are mostly blank, so I have no clue if the DAG processor is the culprit. We also replaced the CeleryExecutor to the KubernetesExecutor, because it is more suited to our purposes. We did not think it had any relation to the issue and, as expected, the behavior persists. Since I am from the cloud-infra team, and have no previous experience in Airflow, can someone help me understand what could be the issue and possible next steps in diagnosing our environment? We are using airflow 2.9.3 (since it's the most recent in the latest Helm available), python 3.12, in a custom Dockerfile. We are not extending the image, we are really customizing it, since we need to perform a couple of compilations and it was more optimal to do this prior to the airflow pip installs, to make the rebuild faster and the final image smaller. I did not know if it was safe to point the image to the latest Airflow available (since I assume an updated Helm would be published if this was the case), so we kept using it this one. Embedding the DAGs onto the image is not an option, since they are changed constantly and the time to rebuild and the process of redeploying the cluster several times a day is not ideal for us. If updating the cluster to 2.10.3 is safe and has any known issues regarding this behavior, please point me in the right way. Thanks for any tips! |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 6 replies
-
That line is important. It tells that dag processor, did not parse some of your DAGs within https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#dag-stale-not-seen-duration time (10 minutes) - the timestamps in your log file confirm that - simply 10 minutes passed since the parsing of the DAG files produced the 8 DAG ids that were deactivated. How you can prepare for investigationNow. The question is why and we need you to investigate several things to find out.
Things that could be wronga) Your Dynamic DAG generation in your DAGs is buggy / unstable and produces different set of DAG ids every time it is parsed. Generally speaking, when you do Dynamic DAG generation, every time a DAG file is parsed, it shoudl produce consistently the same dag ids. Again - it is entirely up to you how the DAGs are written. There are different techniques you can use for dynamic DAG generation https://airflow.apache.org/docs/apache-airflow/stable/howto/dynamic-dag-generation.html#dynamic-dags-with-external-configuration-from-a-structured-data-file - and it might be that the way how you do it is simply unstable. For example pulls information about the various DAGs generated as Json file and that file content changes in unpredictable / non stable ways. Or there is a bug in the DAG generation code that causes an exception sometimes and not all DAGs are generated. Generally - when you do This is most probable cause. b) some of your files are not parsed for some reasons. Generally airflow DAG file processor is parsing continuously all files and should go through all the files in a loop, but there are certain parameters that control it:
There are few other parameters, but those are the important ones. c) Your DAGs do notfolow best practices. So one of other options is that simply parsing of some of your DAG files takes a long time - long enough that it takes more than 1- minutes for the dag file processor to go in a look to parse all your files. If you follow best practices https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#reducing-dag-complexity - and make sure that you do not block or spend a lot of time in parsing your top-level code of DAG files https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code - parsing of even 100s of files should take seconds at most. But there are cases - especially when your top-level code is reaching out to external sources and possibly "hangs" or takes a lot of time to complete parsing, your parsing can take arbitrary long - minutes or hours. If you have some of your DAG files doing it, they might simply hang for a long time - if you have 2 parallel dag parsing processes running - it's enough that you have two files where parsing takes around 10 minutes and that will cause some of the remaining files to not be parsed "on time". Similarly if you have many DAG files that are parsed in few minutes each - that might delay the queue enough, that some of your DAGs will not parse within default 10 minutes You need to make sure to optimize your parsing time and follow best practices - ideally, all your DAGs should be parsed in seconds rather than minutes. Of course you can also play with the parameters above, and make timeouts bigger, but that is rather masking the "long parsing being problematic" rather than solve the problem and will results in for example far longer times on reflecting dag file changes into parsed DAGs in DB. You can control DAG parsing timeout https://airflow.apache.org/docs/apache-airflow/stable/faq.html#how-to-control-dag-file-parsing-timeout-for-different-dag-files and the next section https://airflow.apache.org/docs/apache-airflow/stable/faq.html#when-there-are-a-lot-1000-of-dag-files-how-to-speed-up-parsing-of-new-files explains some of the ways you can attempt speed up parsing. This is the second most probable cause for what you see. d) your syncing process might have some on/off states where the files are appearing / disappearing after check-out. Git-sync works in the way that it checks out the latest commit and then swaps out with previous check-out via symbolic link. Maybe there is Not very likely though e) it might be that some of the combination of the parameters above and your synchronization settings above causes bad sorting of parsed files It might simply cause that the processors are not parsing your files - there might be various reasons - for example if your sorting order is This is also quite unlikely reason, but if you use non POSIX compliant shared filesystems, i can imagine it can happen. f) Finally - time on your various machines might not be synchronized. If one of the machines (DB, scheduler, processor, git) has a significant drift of time, it might be that causes various time calculation wrong. Not very likely, most of the computing resources out there have ntp and similar way of syncing time, but we've seen that happening. Good luck with your investigation and please come back here with the results. |
Beta Was this translation helpful? Give feedback.
-
As explained in #39332 @vghar-bh Please follow the above guide and explain results of your investigation that I outlined - that might help us with maybe getting better messaging and explanation on what would go wrong with a number of deployments - or maybe we will eventually find some problem in Airflow that might cause it. I'd really appreciate a dilligent reading and understanding of how things work and how it maps to your deployment, because this is the only way we can get to the bottom of the issues you have with disappearing DAGs - that might help us to build more resilience, messaging and possibly even find some bugs in Airflow that might help other users having similar issues. Your help here and dilligence is greatly appreciated. |
Beta Was this translation helpful? Give feedback.
-
BTW. I am plannig to turn that above explanation of mine into "What to do if you see disappearing DAGs" FAQ on our website, so any feedback to that description is also most welcome. |
Beta Was this translation helpful? Give feedback.
-
Found the issue in our case. It was option "c". After some discussion with the data team, which is the one which creates the DAGs, it was clear no best practices were being used. Besides large imports (which did not need to happen during the parse stage), there were some costly operations happening, like connecting to a vault to get secure parameters and also some database connections. This multiplied by the number of affected DAGs clearly explained the observed behavior. Thanks again for the detailed post which made my life a lot easier for them to understand it was not an infrastructure issue. |
Beta Was this translation helpful? Give feedback.
That line is important. It tells that dag processor, did not parse some of your DAGs within https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#dag-stale-not-seen-duration time (10 minutes) - the timestamps in your log file confirm that - simply 10 minutes passed since the parsing of the DAG files produced the 8 DAG ids that were deactivated.
How you can prepare for investigation
Now. The question is why and we need you to investigate several things to find out.
First of all - find out which dags disappeared (got deacti…