The executor reported that the task instance <TaskInstance: ... [queued]> finished with state success #42991
-
Yes, quite similar to a few other discussions:
I am also observing tasks being marked as SUCCESS then immediately marked as FAILED. Something that I believe to be different in this case compared to the others linked is my tasks are completing very quickly (<1s) which I wonder if that's contributing? Specifically in this case I a have a DAG that builds some data since What I am observing is a number of the backfill DagRuns failing all with the same behaviour; event log shows task marked success and then failed very soon after, and the infamous log line mentioned in the title. Here's a snippet of collected logs: lines starting with
The task simply starts an Athena query using My cluster right now only has 5x DAGs total, each with 1-3 tasks each. So a very small deployment (1x Fargate task each for scheduler, worker, web, triggerer, flower, db.t4g.medium RDS postgres for metadb, cache.t2.micro for broker) and I don't believe I am hitting any of the timeouts mentioned in other threads (Celery operation timeout, dagbag import/process, etc.). |
Beta Was this translation helpful? Give feedback.
Replies: 13 comments 1 reply
-
You can see it sometimes happen for the first operator too; it's an "allowed lateness" operator (#42636) which also succeeds very quickly as the lateness cutoff has already elapsed (i.e. this |
Beta Was this translation helpful? Give feedback.
-
Weirdly I do not observe this in |
Beta Was this translation helpful? Give feedback.
-
Also forgot to mention we are on Airflow 2.10.0. |
Beta Was this translation helpful? Give feedback.
-
I wonder if https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-78+Scheduler-managed+backfill will fix? Potentially a bug in the backfill CLI scheduler that does not exist in the "real" scheduler? (Based on my observation that this seems to only happen for my backfill jobs). |
Beta Was this translation helpful? Give feedback.
-
I found two copies of the same Airflow task in Celery Flower UI with different UUIDS, so looks like I am somehow running the task twice. |
Beta Was this translation helpful? Give feedback.
-
event log |
Beta Was this translation helpful? Give feedback.
-
Suspicious log
|
Beta Was this translation helpful? Give feedback.
-
Ahhh right, the reason for the second Celery message is because we are using deferred operator. First message is the task starting, second message is task resuming after trigger event. |
Beta Was this translation helpful? Give feedback.
-
Ok! Testing out #43063 on my cluster I am seeing the log line I added get exercised and it is not followed by the infamous
|
Beta Was this translation helpful? Give feedback.
-
still present in airflow 2.10.3 ( wtihout backfill , but manual trigger )
|
Beta Was this translation helpful? Give feedback.
-
I actually learned something about this when working on this one #43520 I found that if anything (i.e. at some place in scheduler) calls executor.change_state or executor.fail, this message is likely to be produced. there are a couple places where it's done. this is because this produces an "executor event" but the scheduler sort of assumes (as indicated by that error message) that executor events come only when task has finished (one way or another) on the worker. and when that's not the case, it's good chance there's a mismatch. oh -- update -- you are looking 'at backfill |
Beta Was this translation helpful? Give feedback.
-
nice work @antonysouthworth-halter |
Beta Was this translation helpful? Give feedback.
Ok! Testing out #43063 on my cluster I am seeing the log line I added get exercised and it is not followed by the infamous
The executor reported that the task instance finished with state success, but the task instance's state attribute is queued
message.