-
-
Notifications
You must be signed in to change notification settings - Fork 622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to use all_gather in training loop? #2504
Comments
@kkarrancsu Thanks for your answer. In general, I can provide an example asap and maybe update the doc accordingly. However, I'm not completely sure about your question. In fact, if you want to compute predictions in ddp, gather and back propagate from one proc, it won't work. You can check the internal design https://pytorch.org/docs/stable/notes/ddp.html#internal-design |
@sdesrozis Thanks for your quick reply! Sorry if my initial question was unclear. As an example:
I'd like to gather all the |
Thanks for the clarification. Would you like to use the loss as a metric ? Or would you want to call |
I'd like to call |
Ok so I think it won't work even if you gather the predictions. The gathering operation is not an autodiff function so it will cut the graph computation. The forward pass creates some internal states that won't be gathered too. Although I'm pretty sure that is answered in the PyTorch forum. Maybe I'm wrong though and I would be interested by a few discussions about this topic. EDIT see here https://amsword.medium.com/gradient-backpropagation-with-torch-distributed-all-gather-9f3941a381f8 |
@sdesrozis Thanks - I will investigate based on your link and report back. |
Good ! Although I’m doubtful about the link… Interesting by your feedback. |
@kkarrancsu can you provide a bit more details on what exactly you would like to do ? As for distributed autograd, you can check as well : https://pytorch.org/docs/stable/rpc.html#distributed-autograd-framework |
Hi @vfdev-5, sure. We are using the Supervised Contrastive loss to train an embedding. In Eq. 2 of the paper, we see that the loss depends on the number of samples used to compute it (positive and negative). My colleague suggested to me that it is more optimal to compute the loss considering all examples (the entire batch), rather than considering |
Ok I understand. You should have a look to a distributed implementation of SimCLR. See for instance This might give you some inspiration. |
This code is not so correct. Please check this issue: Spijkervet/SimCLR#30 and my pr: Spijkervet/SimCLR#46. |
I have defined my train_step in the exact same way as in the cifar10 example. Is it possible to gather all of the predictions before computing the loss? I haven't seen examples of this pattern in the ignite examples (maybe I'm missing it?), but for my application, it is more optimal to compute the loss after aggregating the forward passes and targets run on multiple GPU's. This only matters when using
DistributedDataParallel
, sinceDataParallel
automatically aggregates the outputs.I see the
idist.all_gather()
function, but am unclear how to use it in a training loop.The text was updated successfully, but these errors were encountered: