Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Multi-node tasks with placement: any #2122

Open
jvstme opened this issue Dec 19, 2024 · 2 comments
Open

[Feature]: Multi-node tasks with placement: any #2122

jvstme opened this issue Dec 19, 2024 · 2 comments
Labels

Comments

@jvstme
Copy link
Collaborator

jvstme commented Dec 19, 2024

Problem

Multi-node tasks can only run on fleets with placement: cluster (source code), which means all nodes must be in the same backend, region, and network.

Some distributed workloads don't require network connectivity between nodes. For example, worker nodes in a distributed data processing workload may fetch data from an external source and upload the processing results to the same source, without ever communicating to other worker nodes or even knowing that other nodes exist.

Currently, it is not possible to run such workloads on backends that don't support private networks (CUDO, DataCrunch, Lambda, RunPod, TensorDock, Vast.ai, Kubernetes) or to run them across backends and regions to optimize costs.

Solution

Allow to specify placement in task configurations. placement: cluster is the current behavior (nodes must be interconnected), while placement: any allows non-interconnected nodes across backends and regions.

The cluster-specific environment variables DSTACK_MASTER_NODE_IP and DSTACK_NODES_IPS are only available with placement: cluster.

The default is placement: cluster for backward compatibility.

Workaround

Multiple single-node runs.

Would you like to help us implement this feature by sending a PR?

Yes

@jvstme jvstme added the feature label Dec 19, 2024
Copy link

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Jan 19, 2025
Copy link

github-actions bot commented Feb 3, 2025

This issue was closed because it has been inactive for 14 days since being marked as stale. Please reopen the issue if it is still relevant.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 3, 2025
@jvstme jvstme reopened this Feb 3, 2025
@github-actions github-actions bot removed the stale label Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant