Adding max_retries for ECSOperator #13725

markhopson · 2021-01-17T03:27:04Z

markhopson
Jan 17, 2021

Would it be valuable to add a max_retries param to the ECSOperator?

In my use-case, I have a DAG that uses the ECSOperator to start a Task, but sometimes my ECS cluster is at capacity and the ECSOperator fails. Instead, if the ECSOperator retried a couple of times (say 3 times over the period of 5 mins), it would give my ECS cluster time to scale-out and accomodate the Task being created by the ECSOperator.

What does everyone think?

BTW I'm a first time poster here, and am looking forward to contribute to this project.

Answered by potiuk

Jan 17, 2021

Sure. Go for it. We have similar retry mechanisms implemented in 'google" provider.

This is especially useful if you have a way to distinguish the "capacity" (transient) error from any other "permanent" one):

See here for example (this is "quota_retry" but there are few others):

airflow/airflow/providers/google/common/hooks/base_google.py

Line 359 in 2f79fb9

def quota_retry(*args, **kwargs) -> Callable:

We used tenacity to provide a capability of exponential back-off for such retries and we recommend the same approach - this way you are even better in handling "big" spikes.

We implemented it as decorators, so it could be applied to a wide range of Google Operators (and s…

View full answer

potiuk · 2021-01-17T12:37:48Z

potiuk
Jan 17, 2021
Collaborator

Sure. Go for it. We have similar retry mechanisms implemented in 'google" provider.

This is especially useful if you have a way to distinguish the "capacity" (transient) error from any other "permanent" one):

See here for example (this is "quota_retry" but there are few others):

airflow/airflow/providers/google/common/hooks/base_google.py

Line 359 in 2f79fb9

def quota_retry(*args, **kwargs) -> Callable:

We used tenacity to provide a capability of exponential back-off for such retries and we recommend the same approach - this way you are even better in handling "big" spikes.

We implemented it as decorators, so it could be applied to a wide range of Google Operators (and some Google APIs have built-in Retry capability in which case we used the built-in ones. Maybe you could also work out similar pattern for many Amazon operators?

0 replies

markhopson · 2021-01-17T20:17:23Z

markhopson
Jan 17, 2021
Author

Thanks for the quick (and helpful) response. I'll take a look at the existing pattern in the Google Operators.

If I have any questions, where should I post them?

Thanks again!

2 replies

potiuk Jan 17, 2021
Collaborator

Might be here.

markhopson Feb 2, 2021
Author

Finally got a chance to look at this over the weekend, and it seems that boto3 does not have a consistent way of returning errors across all AWS services.

However, boto3 does seem to be consistent in returning errors across ECS APIs. It seems like ECS APIs like to return a response with a failures param that stores an Array of "error" Objects I can parse.

If all this is correct, it feels like my decorator should be ECS specific - so I would not define in base_aws, and instead, put it in ecs.py?

What do you think?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding max_retries for ECSOperator #13725

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Adding max_retries for ECSOperator #13725

markhopson Jan 17, 2021

Replies: 2 comments · 2 replies

potiuk Jan 17, 2021 Collaborator

markhopson Jan 17, 2021 Author

potiuk Jan 17, 2021 Collaborator

markhopson Feb 2, 2021 Author

markhopson
Jan 17, 2021

Replies: 2 comments 2 replies

potiuk
Jan 17, 2021
Collaborator

markhopson
Jan 17, 2021
Author

potiuk Jan 17, 2021
Collaborator

markhopson Feb 2, 2021
Author