Job heartbeat retry with backoff #37378

awdavidson · 2024-02-13T09:16:15Z

awdavidson
Feb 13, 2024

Hi, we are currently running airflow 2.7.3 with an externally managed postgres instance and recently saw a job terminated due to

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

This issue seemed to be transient as rerunning the job was successful. It may be a good idea to be able to configure a retry with backoff in Job.heartbeat ensuring the number of total backoff time is less than the Job.heartrate. Would anyone be against this functionality? If you are against this functionality it would be good to understand why.

I am happy to contribute a change

Thanks,

Alfie

potiuk · 2024-02-13T11:51:02Z

potiuk
Feb 13, 2024
Collaborator

We do have retries in specific DB calls when the call is easily recoverable, you can look at Airflow code, usually with

           for attempt in run_with_db_retries(logger=self.log):
                with attempt:

or simply methods decorated with:

    @retry_db_transaction

There is eve configuration parameter in Airflow (max_db_retries) which controls number of retries.

This is don on specific transactions - If the call is in a DB transaction, where processing the transaction leaves some other in-memory side-effects and we can safely assume we can re-do such a transaction safely.

So absolutely - no problem, if you find a specific place and transaction that errored out and could be safely retried, and you can argue about it and make PR changing it, it's ok to have it. But it needs to be a specific transaction, and has to be argumented that yes, it is safe to retry it, so you need to find the particular place, review the code to make sure it can be retried, and submit a PR.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job heartbeat retry with backoff #37378

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Job heartbeat retry with backoff #37378

awdavidson Feb 13, 2024

Replies: 1 comment

potiuk Feb 13, 2024 Collaborator

awdavidson
Feb 13, 2024

potiuk
Feb 13, 2024
Collaborator