[PyTorch] Add context parallel support for packed dataset in THD format #9540

tomlifu · 2024-06-25T22:56:43Z

What does this PR do ?

This PR adds context parallel support for packed dataset in THD format in NeMo in response to this TE PR: NVIDIA/TransformerEngine#641. Currently, the TE PR requires each individual sequence length is divisible by (2*context_parallel_size).

Changes

Add support to split packed dataset across different CP ranks in a load balanced way
Add necessary paddings to dataset during packing stage to make sure the individual sequence length is a multiple of 2*cp_size

PR Type:

New Feature
Bugfix
Documentation

xrennvidia · 2024-07-07T22:26:15Z

You code indentation is not consistent, some places have 4 spaces, and other places have 2 spaces.
NeMo code has 4 spaces always, so please make sure all your code have 4-space indentation.

scripts/nlp_language_modeling/prepare_packed_ft_dataset.py

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

Signed-off-by: tomlifu <[email protected]>

scripts/nlp_language_modeling/prepare_packed_ft_dataset.py

@@ -17,6 +17,7 @@
 from typing import TYPE_CHECKING, Tuple

 import numpy as np
+import torch


Signed-off-by: tomlifu <[email protected]>

xrennvidia · 2024-07-09T22:54:50Z

Thanks for fixing the comments, it looks much better now.
The total sequence length (size of t in THD format) is a constant, right? If so, we should have some padded tokens at the end? how those padded tokens are split across different CP ranks?

xrennvidia · 2024-07-09T22:51:34Z

scripts/nlp_language_modeling/prepare_packed_ft_dataset.py

+        ceil_to_nearest = lambda n, m: (n + m - 1) // m * m
+        for data in dataset:
+            max_length = min(max_seq_length, ceil_to_nearest(len(data['input_ids']), pad_seq_length_to_mult))
+            pre_pad_dataset(data, max_length, pad_id)


How the loss_mask is handled for padded tokens?

The loss_mask is handled in the packing function here: https://github.com/NVIDIA/NeMo/blob/main/scripts/nlp_language_modeling/prepare_packed_ft_dataset.py#L142

xrennvidia · 2024-07-09T22:53:01Z

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

+                        cu_seqlens = cu_seqlens // cp_size
+                        forward_args = {
+                            'input_ids': batch['tokens'],
+                            'position_ids': batch['position_ids'],


the position_ids means the token_id in packed sequence? how is this argument used in training fwd and bwd?

The position_ids is the position of the tokens in a sequence (e.g. [0,1,2, ... , seq_len-1]). In a packed sequence, we have a list of position_ids since the packed sequence is composed of many individual sequences. I'm not too sure if that's what you mean by token_id. It's used the same way as input_ids in training fwd and bwd.

github-actions · 2024-07-25T01:50:20Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

github-actions · 2024-08-09T01:52:14Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

github-actions · 2024-08-17T01:49:23Z

This PR was closed because it has been inactive for 7 days since being marked as stale.

youth123 · 2024-10-12T02:34:28Z

What does this PR do ?

This PR adds context parallel support for packed dataset in THD format in NeMo in response to this TE PR: NVIDIA/TransformerEngine#641. Currently, the TE PR requires each individual sequence length is divisible by (2*context_parallel_size).

Changes

Add support to split packed dataset across different CP ranks in a load balanced way

Add necessary paddings to dataset during packing stage to make sure the individual sequence length is a multiple of 2*cp_size

PR Type:

New Feature

Bugfix

Documentation

Hello, I have some problems with cp need seqlen divisibility world_size * 2. I see that you are padding data in the code, but this will cause the pad token id to enter the flash attention calculation. I don't know if this is correct.

Add context parallel support for packed dataset

c938bdd

github-actions bot added the NLP label Jun 25, 2024

tomlifu changed the title ~~[PyTorch] Add context parallel support for packed dataset in THD format~~ [Draft][PyTorch] Add context parallel support for packed dataset in THD format Jun 26, 2024

xrennvidia self-requested a review June 29, 2024 02:25

xrennvidia reviewed Jul 7, 2024

View reviewed changes

Addressing Xiaowei's review comments

525003e

tomlifu changed the title ~~[Draft][PyTorch] Add context parallel support for packed dataset in THD format~~ [PyTorch] Add context parallel support for packed dataset in THD format Jul 9, 2024

tomlifu and others added 2 commits July 9, 2024 14:28

Merge branch 'main' into main

3c69f8e

Apply isort and black reformatting

9d01092

Signed-off-by: tomlifu <[email protected]>

github-advanced-security bot found potential problems Jul 9, 2024

View reviewed changes

scripts/nlp_language_modeling/prepare_packed_ft_dataset.py Outdated

@@ -17,6 +17,7 @@

from typing import TYPE_CHECKING, Tuple

import numpy as np

import torch

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'torch' is not used.

removed unused import torch

6d240de

Signed-off-by: tomlifu <[email protected]>

xrennvidia reviewed Jul 9, 2024

View reviewed changes

Merge branch 'main' into main

b9a5af4

github-actions bot added the stale label Jul 25, 2024

Merge branch 'main' into main

9b506fe

github-actions bot removed the stale label Jul 26, 2024

github-actions bot added the stale label Aug 9, 2024

github-actions bot closed this Aug 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Add context parallel support for packed dataset in THD format #9540

[PyTorch] Add context parallel support for packed dataset in THD format #9540

tomlifu commented Jun 25, 2024

xrennvidia commented Jul 7, 2024

xrennvidia commented Jul 9, 2024

xrennvidia Jul 9, 2024

tomlifu Jul 10, 2024

xrennvidia Jul 9, 2024

tomlifu Jul 10, 2024

github-actions bot commented Jul 25, 2024

github-actions bot commented Aug 9, 2024

github-actions bot commented Aug 17, 2024

youth123 commented Oct 12, 2024 •

edited

Loading

What does this PR do ?

Changes

[PyTorch] Add context parallel support for packed dataset in THD format #9540

[PyTorch] Add context parallel support for packed dataset in THD format #9540

Conversation

tomlifu commented Jun 25, 2024

What does this PR do ?

Changes

xrennvidia commented Jul 7, 2024

xrennvidia commented Jul 9, 2024

xrennvidia Jul 9, 2024

Choose a reason for hiding this comment

tomlifu Jul 10, 2024

Choose a reason for hiding this comment

xrennvidia Jul 9, 2024

Choose a reason for hiding this comment

tomlifu Jul 10, 2024

Choose a reason for hiding this comment

github-actions bot commented Jul 25, 2024

github-actions bot commented Aug 9, 2024

github-actions bot commented Aug 17, 2024

youth123 commented Oct 12, 2024 • edited Loading

What does this PR do ?

Changes

youth123 commented Oct 12, 2024 •

edited

Loading