Duplicated process and memory leakage for evaluation process in all_gather #3147

SangbumChoi · 2024-10-09T04:46:47Z

System Info

For all accelerate version

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

Related issue
huggingface/transformers#15466
https://github.com/huggingface/transformers/pull/28769/files

Expected behavior

accelerate/src/accelerate/utils/operations.py

Line 353 in 55136b8

torch.distributed.all_gather(output_tensors, tensor)

All the accelerate gather function is stricted to all_gather. However, there are also the way of using gather in main process to calculate the evaluation process. If we use all_gather for the evaluation process and pass it to cpu it will cost n times (n is number of process). However we only require to gather the distributed variable to one place to calculate.

What do you think about this?

https://github.com/facebookresearch/detectron2/blob/ebe8b45437f86395352ab13402ba45b75b4d1ddb/detectron2/utils/comm.py#L188

The text was updated successfully, but these errors were encountered:

SangbumChoi · 2024-10-09T05:29:47Z

#2898

muellerzr · 2024-10-10T18:11:22Z

@SangbumChoi definitely open to trying out something more efficient! Best case scenario we have a flag to use all_gather instead, and default to this new method as part of the func. Would you like to take a stab at a PR?

SangbumChoi linked a pull request Oct 14, 2024 that will close this issue

add use_all_gather for option #3164

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicated process and memory leakage for evaluation process in all_gather #3147

Duplicated process and memory leakage for evaluation process in all_gather #3147

SangbumChoi commented Oct 9, 2024

SangbumChoi commented Oct 9, 2024

muellerzr commented Oct 10, 2024 •

edited

Loading

Duplicated process and memory leakage for evaluation process in all_gather #3147

Duplicated process and memory leakage for evaluation process in all_gather #3147

Comments

SangbumChoi commented Oct 9, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

SangbumChoi commented Oct 9, 2024

muellerzr commented Oct 10, 2024 • edited Loading

muellerzr commented Oct 10, 2024 •

edited

Loading