You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If a node in a multi-node task fails, the remaining nodes are terminated and the entire run fails. This forces the user to restart the task, even though some frameworks are resilient to node failures and can redistribute the work among the remaining nodes.
Solution
Allow configuring dstack's response to node failures in the task configuration. The possible failure responses are:
Terminate the remaining nodes (current behavior)
Retry running the node
Ignore the failure unless it was the last node in the run
Additional information
Whether this applies to services is an open question. Services can already achieve failure responses 1 and 2 by omitting or specifying the retry policy. Failure response 3 is not achievable.
It is also an open question how to interpret the retry policy for different failure responses. Currently, retry policies work differently for tasks and services: tasks are terminated and retried completely with all the nodes, while services are retried on a per-replica basis.
Would you like to help us implement this feature by sending a PR?
Yes
The text was updated successfully, but these errors were encountered:
Problem
If a node in a multi-node task fails, the remaining nodes are terminated and the entire run fails. This forces the user to restart the task, even though some frameworks are resilient to node failures and can redistribute the work among the remaining nodes.
Solution
Allow configuring
dstack
's response to node failures in the task configuration. The possible failure responses are:Additional information
Whether this applies to services is an open question. Services can already achieve failure responses 1 and 2 by omitting or specifying the
retry
policy. Failure response 3 is not achievable.It is also an open question how to interpret the
retry
policy for different failure responses. Currently,retry
policies work differently for tasks and services: tasks are terminated and retried completely with all the nodes, while services are retried on a per-replica basis.Would you like to help us implement this feature by sending a PR?
Yes
The text was updated successfully, but these errors were encountered: