Executor reports task instance <TaskInstance: XXX [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally #42966
Replies: 12 comments 4 replies
-
That 10 minutes gap could be because of this config: airflow/airflow/config_templates/config.yml Lines 2600 to 2606 in 6229c1c |
Beta Was this translation helpful? Give feedback.
-
Maybe, it makes some sense as we do not have that setting in the configuration. So, I believe that the real question is, why is the task queued but never started? All timeouts on our config: |
Beta Was this translation helpful? Give feedback.
-
It's the same case for me when the incident happened the other day. Workers are all online, but no task gets executed. I notice something abnormal in scheduler log when this happens: all tasks in a dag was repeatedly queued for thousands of times in one second. Looks like scheduler gets into a strange state |
Beta Was this translation helpful? Give feedback.
-
There is something messed up.
Today I had a filesystem sensor that failed by timeout (airflow task) but
the celery task was successful ...
A terça, 6/08/2024, 20:15, scaoupgrade ***@***.***> escreveu:
… Maybe, it makes some sense as we do not have that setting in the
configuration. But, what is bothering me is WHY? why did it timeout after
being queued? We have no exhaustion anywhere, not in CPU, Memory, pool
slots, concurrency. I'd say that at the current time we have a very light
system...
So, I believe that the real question is, why is the task queued but never
started?
+-1year with some cleanups [image: image]
<https://private-user-images.githubusercontent.com/16724800/355507694-6aa42a5d-8980-4f8f-b31b-3202f99f276f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjI5NzE4NDksIm5iZiI6MTcyMjk3MTU0OSwicGF0aCI6Ii8xNjcyNDgwMC8zNTU1MDc2OTQtNmFhNDJhNWQtODk4MC00ZjhmLWIzMWItMzIwMmY5OWYyNzZmLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA4MDYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwODA2VDE5MTIyOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTI2NjYzZDk2YzNkZmNhNjNjNzM3YWM0MDczMmY0Nzk4MDI2ZTM5OTQyYjEzNjU3ODZlMGU1ZDFmYTA0MTZhMzgmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.KROI5i_Ta7uAa9f4BEntuF3Ibz3TbkhZ5FPdi1-ZkGk>
All timeouts on our config: dagbag_import_timeout = 120.0
dag_file_processor_timeout = 180 default_task_execution_timeout =
web_server_master_timeout = 120 web_server_worker_timeout = 120
smtp_timeout = 30 operation_timeout = 2 stalled_task_timeout = 0
default_timeout = 604800
It's the same case for me when the incident happened the other day.
Workers are all online, but no task gets executed. I notice something
abnormal in scheduler log when this happens: all tasks in a dag was
repeatedly queued for thousands of times in one second. Looks like
scheduler gets into a strange state
—
Reply to this email directly, view it on GitHub
<#41276 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD7TGQF6SPFC3OJWACXLBN3ZQEOEPAVCNFSM6AAAAABMBY37NCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZRHE3TGMBSGI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
I had the same, we noticed it was very aligned with Redis instabilities. after we migrated to saas redis, we did not see it anymore. Also allowing keda to auto scaling to 0 naturally restarted workers. mini scheduler was also disabled, so I'm unsure what fixed. |
Beta Was this translation helpful? Give feedback.
-
I use Amazon Managed Workflow Apache Airflow 2.9.2, met the issue when execute a simple dag, the Task log is: the schedule log as: |
Beta Was this translation helpful? Give feedback.
-
Please check this property AIRFLOW__CELERY__OPERATION_TIMEOUT. In our environment we were getting this airflowtasktimeout error. We had to increase the time from 1.0 sec . |
Beta Was this translation helpful? Give feedback.
-
thanks very much, it works. |
Beta Was this translation helpful? Give feedback.
-
Converting to a discussion - seems that there is a working solution. If there is anything else to do - new issue with concrete case could be created. |
Beta Was this translation helpful? Give feedback.
-
Im having a simular issue and wondering where and how to set the mentioned time out. The number of seconds to wait before timing out
|
Beta Was this translation helpful? Give feedback.
-
AIRFLOW__CELERY__OPERATION_TIMEOUT = 30 is solving most this problem |
Beta Was this translation helpful? Give feedback.
-
Apache Airflow version
Other Airflow 2 version (please specify below)
If "Other Airflow 2 version" selected, which one?
2.9.2
What happened?
Case 2 of issue #39717
Task fails with the message:
Executor reports task instance <TaskInstance: CSDISPATCHER_OTHERS.dispatch_restores scheduled__2024-08-05T19:47:00+00:00 [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally.
Despite that the Airflow task has an external_executor_id, it is not registered in flower ( celery)
Dagrun Running | Task instance's dagrun was not in the 'running' state but in the state 'failed'. -- | -- Task Instance State | Task is in the 'failed' state.
external_executor_id a01b358f-ad34-4b16-81b9-fd69218f613e does not exist in flower / celery
look at the timestamps in the logs:
there is a gap of 10minutes betwwen the Start ( dummy task) and the dispatch_restores task.
And this behaviour is recurrent, (the 10m gap)
and in the task log:
'attempt=1.log.SchedulerJob.log'
(tkapp) ttauto@slautop02$ cat attempt=1.log.SchedulerJob.log
[2024-08-05T21:02:15.585+0100] {event_scheduler.py:40} WARNING - Marking task instance <TaskInstance: CSDISPATCHER_OTHERS.dispatch_restores scheduled__2024-08-05T19:47:00+00:00 [queued]> stuck in queued as failed. If the task instance has available retries, it will be retried.
[2024-08-05T21:02:16.750+0100] {scheduler_job_runner.py:843} ERROR - Executor reports task instance <TaskInstance: CSDISPATCHER_OTHERS.dispatch_restores scheduled__2024-08-05T19:47:00+00:00 [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally?
What you think should happen instead?
The external task should exist in celery and the reason for the task to be marked as failed should be explicit.
There is no other info that justifies the task to fail.
How to reproduce
I just happens
Operating System
NAME="Red Hat Enterprise Linux" VERSION="8.9 (Ootpa)" ID="rhel" ID_LIKE="fedora" VERSION_ID="8.9" PLATFORM_ID="platform:el8" PRETTY_NAME="Red Hat Enterprise Linux 8.9 (Ootpa)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos" HOME_URL="https://www.redhat.com/" DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8" REDHAT_BUGZILLA_PRODUCT_VERSION=8.9 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="8.9"
Versions of Apache Airflow Providers
apache-airflow==2.9.2
apache-airflow-providers-celery==3.7.2
apache-airflow-providers-common-io==1.3.2
apache-airflow-providers-common-sql==1.14.0
apache-airflow-providers-fab==1.1.1
apache-airflow-providers-ftp==3.9.1
apache-airflow-providers-hashicorp==3.7.1
apache-airflow-providers-http==4.11.1
apache-airflow-providers-imap==3.6.1
apache-airflow-providers-postgres==5.11.1
apache-airflow-providers-sftp==4.10.1
apache-airflow-providers-smtp==1.7.1
apache-airflow-providers-sqlite==3.8.1
apache-airflow-providers-ssh==3.11.1
Deployment
Other
Deployment details
Just simple airflow install on linux box.
Using PostgresQLm Rabbitmq and celery executor
Anything else?
randomly
Are you willing to submit PR?
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions