You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Multi-node tasks can only run on fleets with placement: cluster (source code), which means all nodes must be in the same backend, region, and network.
Some distributed workloads don't require network connectivity between nodes. For example, worker nodes in a distributed data processing workload may fetch data from an external source and upload the processing results to the same source, without ever communicating to other worker nodes or even knowing that other nodes exist.
Currently, it is not possible to run such workloads on backends that don't support private networks (CUDO, DataCrunch, Lambda, RunPod, TensorDock, Vast.ai, Kubernetes) or to run them across backends and regions to optimize costs.
Solution
Allow to specify placement in task configurations. placement: cluster is the current behavior (nodes must be interconnected), while placement: any allows non-interconnected nodes across backends and regions.
The cluster-specific environment variables DSTACK_MASTER_NODE_IP and DSTACK_NODES_IPS are only available with placement: cluster.
The default is placement: cluster for backward compatibility.
Workaround
Multiple single-node runs.
Would you like to help us implement this feature by sending a PR?
Yes
The text was updated successfully, but these errors were encountered:
Problem
Multi-node tasks can only run on fleets with
placement: cluster
(source code), which means all nodes must be in the same backend, region, and network.Some distributed workloads don't require network connectivity between nodes. For example, worker nodes in a distributed data processing workload may fetch data from an external source and upload the processing results to the same source, without ever communicating to other worker nodes or even knowing that other nodes exist.
Currently, it is not possible to run such workloads on backends that don't support private networks (CUDO, DataCrunch, Lambda, RunPod, TensorDock, Vast.ai, Kubernetes) or to run them across backends and regions to optimize costs.
Solution
Allow to specify
placement
in task configurations.placement: cluster
is the current behavior (nodes must be interconnected), whileplacement: any
allows non-interconnected nodes across backends and regions.The cluster-specific environment variables
DSTACK_MASTER_NODE_IP
andDSTACK_NODES_IPS
are only available withplacement: cluster
.The default is
placement: cluster
for backward compatibility.Workaround
Multiple single-node runs.
Would you like to help us implement this feature by sending a PR?
Yes
The text was updated successfully, but these errors were encountered: