Using accelerate launch to initialize sagemaker job doesn't work properly with multiple GPUs #3148

BaldPulse · 2024-10-09T10:38:00Z

System Info

- `Accelerate` version: 0.30.1
- Platform: Linux-6.8.0-1015-aws-x86_64-with-glibc2.31
- `accelerate` bash location: /opt/conda/bin/accelerate
- Python version: 3.11.9
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.0 (False)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 30.98 GB
- `Accelerate` default config:
	Not found

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

! Please note that the system info above does not reflect the actual environment accelerate runs in on Sagemaker. The above config is generated in a Sagemaker official container.

To reproduce the bug:

Create any training script that invokes accelerator.gather()
Configure accelerate to run on a Sagemaker multi-gpu machine using accelerate config, use 209479262201.dkr.ecr.us-west-2.amazonaws.com/1xgpt-from-sagemaker:2.3.0 as your docker image
Create a training job using accelerate launch and run the training script

Expected behavior

Sagemaker will return an error somewhere along the lines of this:

File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 2373, in gather_for_metrics
 data = self.gather(input_data)
 ^^^^^^^^^^^^^^^^^^^^^^^
 File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 2329, in gather
 return gather(tensor)
 ^^^^^^^^^^^^^^
 File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/operations.py", line 380, in wrapper
 return function(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/operations.py", line 441, in gather
 return _gpu_gather(tensor)
 ^^^^^^^^^^^^^^^^^^^
 File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/operations.py", line 360, in _gpu_gather
 return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/operations.py", line 126, in recursively_apply
 return func(data, *args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/operations.py", line 350, in _gpu_gather_one
 gather_op(output_tensors, tensor)
 File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
 return func(*args, **kwargs)
 ^^^^^^^^^^^^^^^^^^^^^
 File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2948, in all_gather_into_tensor
 work = group._allgather_base(output_tensor, input_tensor, opts)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 RuntimeError: SMDDP does not support: _allgather_base

The text was updated successfully, but these errors were encountered:

BaldPulse · 2024-10-09T10:39:29Z

If accelerate launch is invoked inside of sagemaker instead of used to create the sagemaker job, the script works fine. I suspect this is because MPI is not well-supported by sagemaker yet accelerate launch uses MPI

muellerzr · 2024-10-10T18:13:38Z

Yes, I'd recommend invoking inside of sagemaker instead in this case. (Though MPI should only be ran on CPU, not GPU)

BaldPulse · 2024-10-15T09:06:16Z

Yes, I'd recommend invoking inside of sagemaker instead in this case. (Though MPI should only be ran on CPU, not GPU)

Sorry if I wasn't clear in my original report. This is more of a complaint on the default behavior of of accelerate launch when configured to run on SageMaker. When I followed this guide to configure and run accelerate with SageMaker's, it defaulted to MPI, which doesn't work with distributed training on SageMaker. accelerate luanch should default to NCCL when configured to run distributed training on SageMaker.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using accelerate launch to initialize sagemaker job doesn't work properly with multiple GPUs #3148

Using accelerate launch to initialize sagemaker job doesn't work properly with multiple GPUs #3148

BaldPulse commented Oct 9, 2024 •

edited by muellerzr

Loading

BaldPulse commented Oct 9, 2024

muellerzr commented Oct 10, 2024

BaldPulse commented Oct 15, 2024 •

edited

Loading

Using accelerate launch to initialize sagemaker job doesn't work properly with multiple GPUs #3148

Using accelerate launch to initialize sagemaker job doesn't work properly with multiple GPUs #3148

Comments

BaldPulse commented Oct 9, 2024 • edited by muellerzr Loading

System Info

Information

Tasks

Reproduction

Expected behavior

BaldPulse commented Oct 9, 2024

muellerzr commented Oct 10, 2024

BaldPulse commented Oct 15, 2024 • edited Loading

BaldPulse commented Oct 9, 2024 •

edited by muellerzr

Loading

BaldPulse commented Oct 15, 2024 •

edited

Loading