Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can nccl dynamically add/remove GPU workers? #1543

Open
mzz12 opened this issue Dec 15, 2024 · 5 comments
Open

Can nccl dynamically add/remove GPU workers? #1543

mzz12 opened this issue Dec 15, 2024 · 5 comments

Comments

@mzz12
Copy link

mzz12 commented Dec 15, 2024

Hi,

I would like to know if NCCL supports dynamic GPU worker add/removal?

BR

@wangfakang
Copy link

+1 friendly ping @sjeaugey

@Jazel-Z
Copy link

Jazel-Z commented Dec 30, 2024

like ncclCommSplit or need more flexible interface?

@sjeaugey
Copy link
Member

sjeaugey commented Jan 6, 2025

Right now, the way to add/remove GPUs is to create a new group with more/less ranks. ncclCommSplit can be used for that, or simply ncclCommInitRank.

@andakai
Copy link

andakai commented Jan 7, 2025

Will NCCL support dynamic feature(without creating a new group) in the future?

@sjeaugey
Copy link
Member

sjeaugey commented Jan 7, 2025

It's something we've been thinking about, but it's a pretty complex thing to do, and whether it would be significantly faster than re-creating a communicator is unclear.

It's not as simple as it seems; as many things are pre-computed assuming a given set of GPUs and adding/removing ranks would require to recompute a lot of things ... maybe almost everything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants