-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PyTorch] Add context parallel support for packed dataset in THD format #9540
Conversation
You code indentation is not consistent, some places have 4 spaces, and other places have 2 spaces. |
nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py
Outdated
Show resolved
Hide resolved
Signed-off-by: tomlifu <[email protected]>
Signed-off-by: tomlifu <[email protected]>
Thanks for fixing the comments, it looks much better now. |
ceil_to_nearest = lambda n, m: (n + m - 1) // m * m | ||
for data in dataset: | ||
max_length = min(max_seq_length, ceil_to_nearest(len(data['input_ids']), pad_seq_length_to_mult)) | ||
pre_pad_dataset(data, max_length, pad_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How the loss_mask is handled for padded tokens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loss_mask is handled in the packing function here: https://github.com/NVIDIA/NeMo/blob/main/scripts/nlp_language_modeling/prepare_packed_ft_dataset.py#L142
cu_seqlens = cu_seqlens // cp_size | ||
forward_args = { | ||
'input_ids': batch['tokens'], | ||
'position_ids': batch['position_ids'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the position_ids means the token_id in packed sequence? how is this argument used in training fwd and bwd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The position_ids is the position of the tokens in a sequence (e.g. [0,1,2, ... , seq_len-1]). In a packed sequence, we have a list of position_ids since the packed sequence is composed of many individual sequences. I'm not too sure if that's what you mean by token_id. It's used the same way as input_ids in training fwd and bwd.
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
This PR was closed because it has been inactive for 7 days since being marked as stale. |
Hello, I have some problems with cp need seqlen divisibility world_size * 2. I see that you are padding data in the code, but this will cause the pad token id to enter the flash attention calculation. I don't know if this is correct. |
What does this PR do ?
This PR adds context parallel support for packed dataset in THD format in NeMo in response to this TE PR: NVIDIA/TransformerEngine#641. Currently, the TE PR requires each individual sequence length is divisible by (2*context_parallel_size).
Changes
PR Type: