-
Notifications
You must be signed in to change notification settings - Fork 956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using accelerate launch to initialize sagemaker job doesn't work properly with multiple GPUs #3148
Comments
If |
Yes, I'd recommend invoking inside of sagemaker instead in this case. (Though MPI should only be ran on CPU, not GPU) |
Sorry if I wasn't clear in my original report. This is more of a complaint on the default behavior of of |
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
! Please note that the system info above does not reflect the actual environment accelerate runs in on Sagemaker. The above config is generated in a Sagemaker official container.
To reproduce the bug:
accelerate config
, use 209479262201.dkr.ecr.us-west-2.amazonaws.com/1xgpt-from-sagemaker:2.3.0 as your docker imageaccelerate launch
and run the training scriptExpected behavior
Sagemaker will return an error somewhere along the lines of this:
The text was updated successfully, but these errors were encountered: