Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Configuring the response to node failures #2123

Open
jvstme opened this issue Dec 19, 2024 · 0 comments
Open

[Feature]: Configuring the response to node failures #2123

jvstme opened this issue Dec 19, 2024 · 0 comments
Labels

Comments

@jvstme
Copy link
Collaborator

jvstme commented Dec 19, 2024

Problem

If a node in a multi-node task fails, the remaining nodes are terminated and the entire run fails. This forces the user to restart the task, even though some frameworks are resilient to node failures and can redistribute the work among the remaining nodes.

Solution

Allow configuring dstack's response to node failures in the task configuration. The possible failure responses are:

  1. Terminate the remaining nodes (current behavior)
  2. Retry running the node
  3. Ignore the failure unless it was the last node in the run

Additional information

Whether this applies to services is an open question. Services can already achieve failure responses 1 and 2 by omitting or specifying the retry policy. Failure response 3 is not achievable.

It is also an open question how to interpret the retry policy for different failure responses. Currently, retry policies work differently for tasks and services: tasks are terminated and retried completely with all the nodes, while services are retried on a per-replica basis.

Would you like to help us implement this feature by sending a PR?

Yes

@jvstme jvstme added the feature label Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant