[Feature]: Configuring the response to node failures #2123

jvstme · 2024-12-19T12:32:19Z

Problem

If a node in a multi-node task fails, the remaining nodes are terminated and the entire run fails. This forces the user to restart the task, even though some frameworks are resilient to node failures and can redistribute the work among the remaining nodes.

Solution

Allow configuring dstack's response to node failures in the task configuration. The possible failure responses are:

Terminate the remaining nodes (current behavior)
Retry running the node
Ignore the failure unless it was the last node in the run

Additional information

Whether this applies to services is an open question. Services can already achieve failure responses 1 and 2 by omitting or specifying the retry policy. Failure response 3 is not achievable.

It is also an open question how to interpret the retry policy for different failure responses. Currently, retry policies work differently for tasks and services: tasks are terminated and retried completely with all the nodes, while services are retried on a per-replica basis.

Would you like to help us implement this feature by sending a PR?

Yes

The text was updated successfully, but these errors were encountered:

jvstme added the feature label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Configuring the response to node failures #2123

[Feature]: Configuring the response to node failures #2123

jvstme commented Dec 19, 2024 •

edited

Loading

[Feature]: Configuring the response to node failures #2123

[Feature]: Configuring the response to node failures #2123

Comments

jvstme commented Dec 19, 2024 • edited Loading

Problem

Solution

Additional information

Would you like to help us implement this feature by sending a PR?

jvstme commented Dec 19, 2024 •

edited

Loading