Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI on CPU-only: "no support for _allgather_base" #3176

Open
2 of 4 tasks
tikhu opened this issue Oct 17, 2024 · 0 comments
Open
2 of 4 tasks

MPI on CPU-only: "no support for _allgather_base" #3176

tikhu opened this issue Oct 17, 2024 · 0 comments

Comments

@tikhu
Copy link

tikhu commented Oct 17, 2024

System Info

- `Accelerate` version: 1.0.1
- Platform: Linux-6.10.4-linuxkit-aarch64-with-glibc2.35
- `accelerate` bash location: /usr/local/bin/accelerate
- Python version: 3.10.12
- Numpy version: 1.24.4
- PyTorch version (GPU?): 2.5.0a0+b465a5843b.nv24.09 (False)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 70.54 GB
- `Accelerate` default config:
        Not found

This is my config file:

compute_environment: LOCAL_MACHINE
debug: true
distributed_type: MULTI_CPU
downcast_bf16: 'no'
enable_cpu_affinity: false
ipex_config:
  ipex: false
machine_rank: 0
main_process_ip: 172.18.0.2
main_process_port: 8888
main_training_function: main
mixed_precision: 'no'
mpirun_config:
  mpirun_ccl: '1'
  mpirun_hostfile: /accelerate/hostfile
num_machines: 2
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: true

For testing this is running with two mpi-connected docker images (based on nvcr.io/nvidia/pytorch:24.09-py3) on an Apple M3 Max running macOS 15

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

  1. On an MPI system with more than one node run accelerate launch --config_file=config_mpi.yaml nlp_example.py --cpu
  2. wait for the run to crash
  3. get this output:
[rank0]: Traceback (most recent call last):
[rank0]:   File "/accelerate/nlp_example.py", line 209, in <module>
[rank0]:     main()
[rank0]:   File "/accelerate/nlp_example.py", line 205, in main
[rank0]:     training_function(config, args)
[rank0]:   File "/accelerate/nlp_example.py", line 179, in training_function
[rank0]:     predictions, references = accelerator.gather_for_metrics((predictions, batch["labels"]))
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2500, in gather_for_metrics
[rank0]:     data = self.gather(input_data)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2456, in gather
[rank0]:     return gather(tensor)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 398, in wrapper
[rank0]:     return function(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 437, in gather
[rank0]:     return _gpu_gather(tensor)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 356, in _gpu_gather
[rank0]:     return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 108, in recursively_apply
[rank0]:     return honor_type(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 82, in honor_type
[rank0]:     return type(obj)(generator)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 111, in <genexpr>
[rank0]:     recursively_apply(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 127, in recursively_apply
[rank0]:     return func(data, *args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 346, in _gpu_gather_one
[rank0]:     gather_op(output_tensors, tensor)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3410, in all_gather_into_tensor
[rank0]:     work = group._allgather_base(output_tensor, input_tensor, opts)
[rank0]: RuntimeError: no support for _allgather_base in MPI process group

If I switch out MPI from under accelerate the example runs without giving an error message.
I do this by adding the following:

import torch.distributed as dist
world_size = int(os.environ.get('ACCELERATE_WORLD_SIZE', '1')) 
rank = int(os.environ.get('ACCELERATE_RANK', '0'))  
dist.init_process_group(backend='gloo', rank=rank, world_size=world_size)

Expected behavior

I expect it not to crash.
It works fine if number of machines is 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant