Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable FP8 in Mcore integration test on older GPUs #1357

Merged
merged 2 commits into from
Dec 6, 2024

Conversation

timmoon10
Copy link
Collaborator

Description

We have experienced test failures in the Mcore integration tests on older GPUs, so this PR checks the GPU arch before enabling FP8. It also generates synthetic data so it can be run without needing to access a specific data path.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refractor

Changes

  • Disable FP8 in Mcore integration test on older GPUs
  • Remove data dependency in Mcore integration test

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Avoid FP8 on Ampere and older. Generate synthetic data instead of depending on external data.

Signed-off-by: Tim Moon <[email protected]>
@timmoon10 timmoon10 added bug Something isn't working testing Improvements to tests or testing infrastructure 1.14.0 labels Dec 5, 2024
@timmoon10
Copy link
Collaborator Author

/te-ci pytorch L1

Copy link
Collaborator

@sudhakarsingh27 sudhakarsingh27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only minor nitpicks/questions. Ow, lgtm

if [[ ${DEVICE_ARCH} -ge 89 ]]; then
WITH_FP8=1
fi

# Download Megatron-LM if needed
if [ ! -d "${MCORE_PATH}" ]; then
pushd $(dirname ${MCORE_PATH})
git clone -b core_r0.9.0 https://github.com/NVIDIA/Megatron-LM.git Megatron-LM
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess then we have to manually update this branch. Is this okay?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's an argument for just downloading the Megatron-LM main branch, but that is orthogonal to this PR.

printf "{" >> ${VOCAB_FILE}
printf "\"<|endoftext|>\": 0" >> ${VOCAB_FILE}
seq 1 4095 | awk '{ printf(", \"%d\": %d", $1, $1) }' >> ${VOCAB_FILE}
printf "}" >> ${VOCAB_FILE}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that's enough to generate a vocab file!

Copy link
Collaborator Author

@timmoon10 timmoon10 Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat specific to the Mcore mock GPT dataset: https://github.com/NVIDIA/Megatron-LM/blob/bd677bfb13ac2f19deaa927adc6da6f9201d66aa/megatron/core/datasets/gpt_dataset.py#L693
The model figures out the number of embeddings based on the vocab size. However, the mock dataset just generates indices with arange (up to a max sequence length of 4096), so the vocab can be junk.

--vocab-file /data/gpt3/pile-cc1-cc2-shuf/bpe/gpt2-vocab.json
--merge-file /data/gpt3/pile-cc1-cc2-shuf/bpe/gpt2-merges.txt
--vocab-file ${VOCAB_FILE}
--merge-file ${TE_PATH}/qa/L1_pytorch_mcore_integration/merges.txt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what's the purpose of this file, but is it okay for it to only have version info?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm treating merges.txt as a magic config for the tokenizer: huggingface/transformers#1083 (comment)

@timmoon10 timmoon10 merged commit d8b13cb into NVIDIA:main Dec 6, 2024
29 checks passed
@timmoon10 timmoon10 deleted the debug-mcore-test branch December 6, 2024 05:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.14.0 bug Something isn't working testing Improvements to tests or testing infrastructure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants