Replies: 6 comments 5 replies
-
I don't think this is in "all cases" - I think you have a specific problem that manifests this way - because the workers would usually use I thnk you need to get deeper into seeing what is your problem (follow the hostname_callable lead). It's likely that you will find the root cause by doing it and likely that it might result in a new feature, but I think more details and digging is need on what your problem is. For now I am converting it into a discussion. We can always convert it back to issue or you will be able to open a new one. Also I suggest to try whatever problem you have with latest airflow version 2.4.2 - there have been some improvements in som parts of the hostname_callable handling that might give better diagnostics into what the real problem is. |
Beta Was this translation helpful? Give feedback.
-
The thing is, independently of how the worker registers itself in the hostname field of database table, I cannot reach the worker log endpoint from the web server even using a
We use EKS, maybe that's also relevant. If I create a pod worker manually adding I've seen that the other airflow helm projects fixed it by using a statefulset so I really wonder how this is supposed to work currently. |
Beta Was this translation helpful? Give feedback.
-
Here is a reproductible example (tested with kind):
When trying to execute
Again, I think this is because when worker logs persistence is disabled, the helm chart uses a deployment instead of a statefulset. I believe there is no good reason for that, and if this is agreed on, I am happy to open a PR. |
Beta Was this translation helpful? Give feedback.
-
I do not know why Deployment was used instead of StatefulSet, but I do agree, StatefulSet is much better suited for the kind of workload with celery workers. It just makes perfect sense to keep stable network identifiers even if persistence is not needed. The reasoning for this is not documented, It has been implemented like that originally in Astronomer when they donated the chart to the communiy:
So currently I can only guess what could be the original reasoning was. Same as anyone else in the community, but maybe @jedcunningham @dstandish could find something in the throves of Astronomer's history. Since you got it working locally I suggest you open PR @ebrard and maybe then @dstandish / @jedcunningham and they might be faster to respond seeing a change coming. |
Beta Was this translation helpful? Give feedback.
-
Deployments are used with the kubernetes event driven autoscaler (KEDA) and without persistent volumes backing workers. For instance on GCP it is non-trivial to get a readWriteMany PV available in the cluster for a small volume like the dag volume. And not always desireable to place logs on a persistent volume when centralized logging is being used. See here: Line 506 in 221249e |
Beta Was this translation helpful? Give feedback.
-
I had this problem for a while, and didn't want to switch to StatefulSet. My SRE team tries to avoid stateful applications in our main Kubernetes cluster, instead using external persistence stores, and I don't know enough to contradict it or what the other implications of this change would be :D Here's the best solution I could come up with! (Investigations I performed before reaching this solution)I dug into the DNSes being set up, and patched in a `subdomain: airflow-triggerer` field on the Deployment, but the hostnames never included the pod name, only the pod IP. After adding the subdomain, I was able to see the DNS entries via: ```bash $ kubectl run ubuntu --image=ubuntu:latest --it --rm -- /bin/bash root@ubuntu:/# apt update && apt install -y dnsutils ... root@ubuntu:/# nslookup -type=SRV airflow-triggerer.airflow-kube-namespace.svc.cluster.local ... airflow-triggerer.airflow-kube-namespace.svc.cluster.local service = 0 100 8794 10-2-3-4.airflow-triggerer.airflow-kube-namespace.svc.cluster.local. ``` (indicating a pod at IP `10.2.3.4`, but not including the pod's full name, and not matching the value of `hostname -A` in the triggerer pod itself)Maybe some of this information will be helpful if someone wants to set up the DNS entries without switching to IP. Ultimately, I saw that we can avoid DNS lookup and just access the worker/triggerer pods directly via IP (following the hint above about hostname_callable), which solved my issue: # in the helm values.yaml
config:
core:
hostname_callable: "airflow.utils.net.get_host_ip_address" Noting this down in case it's useful for anyone else in the future =) |
Beta Was this translation helpful? Give feedback.
-
Official Helm Chart version
1.7.0 (latest released)
Apache Airflow version
2.2.2
Kubernetes Version
1.23
Helm Chart configuration
No response
Docker Image customisations
No response
What happened
When a task runs on a worker, the web server cannot reach the worker to read the log from.
Considering the workers are deployed as a deployment and not as a statefulset, I suspect that a hostname and a subdomain should be used somehow in the worker template, which is currently not the case.
What you think should happen instead
The web server is able to reach any worker using the headless kubernetes service which is created.
How to reproduce
Any deployment of the helm should have this issue I think.
Anything else
No response
Are you willing to submit PR?
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions