Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update multi-node.qmd #1688
base: main
Are you sure you want to change the base?
Update multi-node.qmd #1688
Changes from all commits
e9ab5d8
81c91ab
93e71f0
640bb6c
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my knowledge this is not the case. You need to do
accelerate launch -m
on every server else it will sit there and never actually startThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we tasted, we were only starting on single node (server) and it was able to use the resources from other nodes,
As a proof, we were able to see the ip of both the machines on the left, and in the total GPU it were showing all the GPU form all the nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@muellerzr My guess is this is probably specific to deepspeed since the IP addresses are set in a hostfile. We should probably disambiguate this that it only needs to be run on the first node when this is the case. Most other cases like FSDP or plain multinode DDP will likely still need
accelerate launch
to be run on each node.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@winglian, @muellerzr That might be the case , because for us deepspeed was a good options where we were using multi node for finetuning via EC2, as it provides the public ip , and we used hostfile, it was really easy to connect both machines and run the finetuning on root only, this indeed connected all the other instances. (using all the resources from all the nodes via single node)