Worker logs not reachable from webserver while running #27558

ebrard · 2022-11-08T14:14:52Z

ebrard
Nov 8, 2022

Official Helm Chart version

1.7.0 (latest released)

Apache Airflow version

2.2.2

Kubernetes Version

1.23

Helm Chart configuration

No response

Docker Image customisations

No response

What happened

When a task runs on a worker, the web server cannot reach the worker to read the log from.

Considering the workers are deployed as a deployment and not as a statefulset, I suspect that a hostname and a subdomain should be used somehow in the worker template, which is currently not the case.

What you think should happen instead

The web server is able to reach any worker using the headless kubernetes service which is created.

How to reproduce

Any deployment of the helm should have this issue I think.

Anything else

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

potiuk · 2022-11-08T15:20:23Z

potiuk
Nov 8, 2022
Collaborator

I don't think this is in "all cases" - I think you have a specific problem that manifests this way - because the workers would usually use hostname_callable (https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#hostname-callable) to get their internal name. I suspect what you observe might be the in-famous instabilito of DNS in k8s (which happens quite often adn in some cases it might be even completely broken).

I thnk you need to get deeper into seeing what is your problem (follow the hostname_callable lead). It's likely that you will find the root cause by doing it and likely that it might result in a new feature, but I think more details and digging is need on what your problem is.

For now I am converting it into a discussion. We can always convert it back to issue or you will be able to open a new one.

Also I suggest to try whatever problem you have with latest airflow version 2.4.2 - there have been some improvements in som parts of the hostname_callable handling that might give better diagnostics into what the real problem is.

0 replies

ebrard · 2022-11-08T15:28:50Z

ebrard
Nov 8, 2022
Author

@potiuk

The thing is, independently of how the worker registers itself in the hostname field of database table, I cannot reach the worker log endpoint from the web server even using a curl command. We had a similar issue in the past and we had to explicitly set the hostname and the subdomain spec field in the pod definition, see this kubernetes documentation page:

Because A or AAAA records are not created for Pod names, hostname is required for the Pod's A or AAAA record to be created. A Pod with no hostname but with subdomain will only create the A or AAAA record for the headless Service (default-subdomain.my-namespace.svc.cluster-domain.example), pointing to the Pod's IP address. Also, Pod needs to become ready in order to have a record unless publishNotReadyAddresses=True is set on the Service.

We use EKS, maybe that's also relevant.

If I create a pod worker manually adding hostname and subdomain in the spec, then I can reach it (but obviously it's not part of the worker deployment anymore), so I believe my question still holds.

I've seen that the other airflow helm projects fixed it by using a statefulset so I really wonder how this is supposed to work currently.

4 replies

potiuk Nov 8, 2022
Collaborator

@jedcunningham @dstandish or @ephraimbuddy - maybe you can comment on that :)

dstandish Nov 8, 2022
Collaborator

@ebrard have you tried reproducing locally? e.g. helm installing in docker desktop k8s

ebrard Nov 10, 2022
Author

Yes, as I've mentioned this is not the first time that we encounter this with kubernetes but we moved from our in-house solution to using the helm based deploymentss.

ebrard Nov 10, 2022
Author

Ok, I've checked the helm template and see you do that:

kind: {{ if $persistence }}StatefulSet{{ else }}Deployment{{ end }}

I think there is no reason to use a deployment in any case and this should be simplify to always use a StatefulSet. StatefulSet always creates fully resolvable pods.

@dstandish what's your opinion here? I really don't see any reason to use a deployment.

ebrard · 2022-11-10T11:57:44Z

ebrard
Nov 10, 2022
Author

Here is a reproductible example (tested with kind):

env:
  - name: "environment"
    value: "local"
dags:
  persistence:
    enabled: false
logs:
  persistence:
    enabled: false
workers:
  safeToEvict: false
  replicas: 1
  persistence:
    enabled: false
extraEnv: |-
  - name: AIRFLOW__CORE__LOAD_EXAMPLES
    value: "True"

When trying to execute example_bash_operator and trying to get the logs from task runme_0 , the following error shows up:

*** Log file does not exist: /opt/airflow/logs/dag_id=example_bash_operator/run_id=scheduled__2022-11-09T00:00:00+00:00/task_id=runme_0/attempt=1.log
*** Fetching from: http://airflow-worker-549bcd7d9b-wb4fb:8793/log/dag_id=example_bash_operator/run_id=scheduled__2022-11-09T00:00:00+00:00/task_id=runme_0/attempt=1.log
*** Failed to fetch log file from worker. [Errno -2] Name or service not known

Again, I think this is because when worker logs persistence is disabled, the helm chart uses a deployment instead of a statefulset. I believe there is no good reason for that, and if this is agreed on, I am happy to open a PR.

0 replies

potiuk · 2022-12-04T17:29:45Z

potiuk
Dec 4, 2022
Collaborator

I do not know why Deployment was used instead of StatefulSet, but I do agree, StatefulSet is much better suited for the kind of workload with celery workers. It just makes perfect sense to keep stable network identifiers even if persistence is not needed.

The reasoning for this is not documented, It has been implemented like that originally in Astronomer when they donated the chart to the communiy:
66e7382

kind: {{ if $persistence }}StatefulSet{{ else }}Deployment{{ end }}

So currently I can only guess what could be the original reasoning was. Same as anyone else in the community, but maybe @jedcunningham @dstandish could find something in the throves of Astronomer's history.

Since you got it working locally I suggest you open PR @ebrard and maybe then @dstandish / @jedcunningham and they might be faster to respond seeing a change coming.

0 replies

brokenjacobs · 2023-04-18T18:09:30Z

brokenjacobs
Apr 18, 2023

Deployments are used with the kubernetes event driven autoscaler (KEDA) and without persistent volumes backing workers. For instance on GCP it is non-trivial to get a readWriteMany PV available in the cluster for a small volume like the dag volume. And not always desireable to place logs on a persistent volume when centralized logging is being used.

See here:

airflow/chart/values.yaml

Line 506 in 221249e

# Persistence.enabled must be set to false to use KEDA.

1 reply

brokenjacobs May 5, 2023

Ah I see the point, we could use statefulset even without a PVC being bound (no persistence). Yes this seems like a good way to go.

Throne3d · 2024-12-18T17:32:13Z

Throne3d
Dec 18, 2024

I had this problem for a while, and didn't want to switch to StatefulSet. My SRE team tries to avoid stateful applications in our main Kubernetes cluster, instead using external persistence stores, and I don't know enough to contradict it or what the other implications of this change would be :D Here's the best solution I could come up with!

(Investigations I performed before reaching this solution)

I dug into the DNSes being set up, and patched in a `subdomain: airflow-triggerer` field on the Deployment, but the hostnames never included the pod name, only the pod IP. After adding the subdomain, I was able to see the DNS entries via: ```bash $ kubectl run ubuntu --image=ubuntu:latest --it --rm -- /bin/bash root@ubuntu:/# apt update && apt install -y dnsutils ... root@ubuntu:/# nslookup -type=SRV airflow-triggerer.airflow-kube-namespace.svc.cluster.local ... airflow-triggerer.airflow-kube-namespace.svc.cluster.local service = 0 100 8794 10-2-3-4.airflow-triggerer.airflow-kube-namespace.svc.cluster.local. ``` (indicating a pod at IP `10.2.3.4`, but not including the pod's full name, and not matching the value of `hostname -A` in the triggerer pod itself)

Maybe some of this information will be helpful if someone wants to set up the DNS entries without switching to IP.

Ultimately, I saw that we can avoid DNS lookup and just access the worker/triggerer pods directly via IP (following the hint above about hostname_callable), which solved my issue:

# in the helm values.yaml
config:
  core:
    hostname_callable: "airflow.utils.net.get_host_ip_address"

Noting this down in case it's useful for anyone else in the future =)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker logs not reachable from webserver while running #27558

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Worker logs not reachable from webserver while running #27558

ebrard Nov 8, 2022

Official Helm Chart version

Apache Airflow version

Kubernetes Version

Helm Chart configuration

Docker Image customisations

What happened

What you think should happen instead

How to reproduce

Anything else

Are you willing to submit PR?

Code of Conduct

Replies: 6 comments · 5 replies

potiuk Nov 8, 2022 Collaborator

ebrard Nov 8, 2022 Author

potiuk Nov 8, 2022 Collaborator

dstandish Nov 8, 2022 Collaborator

ebrard Nov 10, 2022 Author

ebrard Nov 10, 2022 Author

ebrard Nov 10, 2022 Author

potiuk Dec 4, 2022 Collaborator

brokenjacobs Apr 18, 2023

brokenjacobs May 5, 2023

Throne3d Dec 18, 2024

ebrard
Nov 8, 2022

Replies: 6 comments 5 replies

potiuk
Nov 8, 2022
Collaborator

ebrard
Nov 8, 2022
Author

potiuk Nov 8, 2022
Collaborator

dstandish Nov 8, 2022
Collaborator

ebrard Nov 10, 2022
Author

ebrard Nov 10, 2022
Author

ebrard
Nov 10, 2022
Author

potiuk
Dec 4, 2022
Collaborator

brokenjacobs
Apr 18, 2023

Throne3d
Dec 18, 2024