Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated process and memory leakage for evaluation process in all_gather #3147

Open
2 of 4 tasks
SangbumChoi opened this issue Oct 9, 2024 · 2 comments · May be fixed by #3164
Open
2 of 4 tasks

Duplicated process and memory leakage for evaluation process in all_gather #3147

SangbumChoi opened this issue Oct 9, 2024 · 2 comments · May be fixed by #3164

Comments

@SangbumChoi
Copy link

System Info

For all accelerate version

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

Related issue
huggingface/transformers#15466
https://github.com/huggingface/transformers/pull/28769/files

Expected behavior

torch.distributed.all_gather(output_tensors, tensor)

All the accelerate gather function is stricted to all_gather. However, there are also the way of using gather in main process to calculate the evaluation process. If we use all_gather for the evaluation process and pass it to cpu it will cost n times (n is number of process). However we only require to gather the distributed variable to one place to calculate.

What do you think about this?

https://github.com/facebookresearch/detectron2/blob/ebe8b45437f86395352ab13402ba45b75b4d1ddb/detectron2/utils/comm.py#L188

@SangbumChoi
Copy link
Author

#2898

@muellerzr
Copy link
Collaborator

muellerzr commented Oct 10, 2024

@SangbumChoi definitely open to trying out something more efficient! Best case scenario we have a flag to use all_gather instead, and default to this new method as part of the func. Would you like to take a stab at a PR?

@SangbumChoi SangbumChoi linked a pull request Oct 14, 2024 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants