Add DDP token averaging for equivalent non-parallel training similar to #34191 #34242

sbwww · 2024-10-18T10:03:25Z

Feature request

Token averaging in gradient accumulation was fixed in #34191 . But token averaging in DDP seems to have the same issue.

Expected behaivor

With all the tokens contributing to loss in each step (in each GPU, gradient accumulation step, and microbatch), the equation becomes:

$$ntokens=\sum\limits_{GPUs} \sum\limits_{gas} \sum\limits_{microb} (label\neq-100)$$

I believe we should average the above tokens at the same time for equivalent non-parallel training.

Current issue

Prior to #34191, the loss/gradients were averaged on $\sum\limits_{GPUs}$, $\sum\limits_{gas}$, and $\sum\limits_{microb}$ separately. And, the introduction of num_items_in_batch in #34191 refers to:

$$ntokens=\sum\limits_{gas} \sum\limits_{microb} (label\neq-100)$$

So, the loss/gradients are now averaged on $\sum\limits_{GPUs}$ and $\left(\sum\limits_{gas}\sum\limits_{microb}\right)$ separately. However, this still does not seem equivalent to non-parallel training.

Can we also incorporate $\sum\limits_{GPUs}$ when determining num_items_in_batch? Something like all_reduce(num_items_in_batch)?

Motivation

DDP seems not fully equivalent to non-parallel training.

related comments: #34191 (comment)

Your contribution

Found some fairseq implementation of this feature

https://github.com/facebookresearch/fairseq/blob/018621f3cca02ca9de945dc082c3fb1a7f9f2deb/fairseq/trainer.py#L932-L949

The text was updated successfully, but these errors were encountered:

muellerzr · 2024-10-18T16:56:49Z

I observed this as well when I was running some experiments (things were close postfix, but not exact). Would you like to take a stab at a PR? :)

techkang · 2024-10-22T04:12:19Z

A simple implemention may be:

add all_reduce(num_items_in_batch, op=SUM) after: https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L2416
add loss *= get_world_size() after: https://github.com/huggingface/transformers/blob/main/src/transformers/loss/loss_utils.py#L26

TechxGenus · 2024-10-22T12:55:13Z

Although this issue has little impact on the training results, it significantly affects to reproduce experiments across different hardware configurations. I hope it can be resolved alongside gradient accumulation.

I attempted to use all-reduce during training, but it slowed down the process. Is it possible to calculate the total number of tokens per batch across devices when initializing the Dataloader with accelerate (without compromising compatibility with the existing code) ?

muellerzr · 2024-10-22T13:32:29Z

That is the issue with it, and why I'm not the biggest fan of that particular solution.

We can't, bc there are situations like IterableDatasets where that just cannot be possible.

The fairseq solution may be the way

muellerzr · 2024-10-22T14:20:59Z

Can confirm the fairseq solution works great, it'll be part of #34283

muellerzr · 2024-10-22T16:12:20Z

This however does not make any impact as we scale (current fix or these ones)

This might be problem specific, however I did find the fix helped a little

sbwww added the Feature request Request for a new feature label Oct 18, 2024

LysandreJik added the Discussion Discussion on a topic (keep it focused or open a new issue though) label Oct 21, 2024

muellerzr linked a pull request Oct 22, 2024 that will close this issue

Enable Gradient Accumulation fix across all models + trainer fully in forward() #34283

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DDP token averaging for equivalent non-parallel training similar to #34191 #34242

Add DDP token averaging for equivalent non-parallel training similar to #34191 #34242

sbwww commented Oct 18, 2024

muellerzr commented Oct 18, 2024 •

edited

Loading

techkang commented Oct 22, 2024

TechxGenus commented Oct 22, 2024 •

edited

Loading

muellerzr commented Oct 22, 2024 •

edited

Loading

muellerzr commented Oct 22, 2024

muellerzr commented Oct 22, 2024 •

edited

Loading

Add DDP token averaging for equivalent non-parallel training similar to #34191 #34242

Add DDP token averaging for equivalent non-parallel training similar to #34191 #34242

Comments

sbwww commented Oct 18, 2024

Feature request

Expected behaivor

Current issue

Motivation

Your contribution

muellerzr commented Oct 18, 2024 • edited Loading

techkang commented Oct 22, 2024

TechxGenus commented Oct 22, 2024 • edited Loading

muellerzr commented Oct 22, 2024 • edited Loading

muellerzr commented Oct 22, 2024

muellerzr commented Oct 22, 2024 • edited Loading

muellerzr commented Oct 18, 2024 •

edited

Loading

TechxGenus commented Oct 22, 2024 •

edited

Loading

muellerzr commented Oct 22, 2024 •

edited

Loading

muellerzr commented Oct 22, 2024 •

edited

Loading