Disable FP8 in Mcore integration test on older GPUs #1357

timmoon10 · 2024-12-05T00:05:32Z

Description

We have experienced test failures in the Mcore integration tests on older GPUs, so this PR checks the GPU arch before enabling FP8. It also generates synthetic data so it can be run without needing to access a specific data path.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Disable FP8 in Mcore integration test on older GPUs
Remove data dependency in Mcore integration test

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Avoid FP8 on Ampere and older. Generate synthetic data instead of depending on external data. Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2024-12-05T00:06:08Z

/te-ci pytorch L1

sudhakarsingh27

Only minor nitpicks/questions. Ow, lgtm

sudhakarsingh27 · 2024-12-05T23:05:35Z

qa/L1_pytorch_mcore_integration/test.sh

+if [[ ${DEVICE_ARCH} -ge 89 ]]; then
+    WITH_FP8=1
+fi
+
 # Download Megatron-LM if needed
 if [ ! -d "${MCORE_PATH}" ]; then
    pushd $(dirname ${MCORE_PATH})
    git clone -b core_r0.9.0 https://github.com/NVIDIA/Megatron-LM.git Megatron-LM


I guess then we have to manually update this branch. Is this okay?

I think there's an argument for just downloading the Megatron-LM main branch, but that is orthogonal to this PR.

sudhakarsingh27 · 2024-12-05T23:40:49Z

qa/L1_pytorch_mcore_integration/test.sh

+printf "{" >> ${VOCAB_FILE}
+printf "\"<|endoftext|>\": 0" >> ${VOCAB_FILE}
+seq 1 4095 | awk '{ printf(", \"%d\": %d", $1, $1) }' >> ${VOCAB_FILE}
+printf "}" >> ${VOCAB_FILE}


Wow, that's enough to generate a vocab file!

This is somewhat specific to the Mcore mock GPT dataset: https://github.com/NVIDIA/Megatron-LM/blob/bd677bfb13ac2f19deaa927adc6da6f9201d66aa/megatron/core/datasets/gpt_dataset.py#L693
The model figures out the number of embeddings based on the vocab size. However, the mock dataset just generates indices with arange (up to a max sequence length of 4096), so the vocab can be junk.

sudhakarsingh27 · 2024-12-05T23:41:55Z

qa/L1_pytorch_mcore_integration/test.sh

--vocab-file /data/gpt3/pile-cc1-cc2-shuf/bpe/gpt2-vocab.json
--merge-file /data/gpt3/pile-cc1-cc2-shuf/bpe/gpt2-merges.txt
+--vocab-file ${VOCAB_FILE}
+--merge-file ${TE_PATH}/qa/L1_pytorch_mcore_integration/merges.txt


Not sure what's the purpose of this file, but is it okay for it to only have version info?

I'm treating merges.txt as a magic config for the tokenizer: huggingface/transformers#1083 (comment)

Debug Mcore integration test

101ec84

Avoid FP8 on Ampere and older. Generate synthetic data instead of depending on external data. Signed-off-by: Tim Moon <[email protected]>

timmoon10 added bug Something isn't working testing Improvements to tests or testing infrastructure 1.14.0 labels Dec 5, 2024

timmoon10 requested a review from sudhakarsingh27 December 5, 2024 00:05

Merge branch 'main' into debug-mcore-test

07a5d1e

sudhakarsingh27 approved these changes Dec 5, 2024

View reviewed changes

timmoon10 merged commit d8b13cb into NVIDIA:main Dec 6, 2024
29 checks passed

timmoon10 deleted the debug-mcore-test branch December 6, 2024 05:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable FP8 in Mcore integration test on older GPUs #1357

Disable FP8 in Mcore integration test on older GPUs #1357

timmoon10 commented Dec 5, 2024

timmoon10 commented Dec 5, 2024

sudhakarsingh27 left a comment

sudhakarsingh27 Dec 5, 2024

timmoon10 Dec 6, 2024

sudhakarsingh27 Dec 5, 2024

timmoon10 Dec 6, 2024 •

edited

Loading

sudhakarsingh27 Dec 5, 2024

timmoon10 Dec 6, 2024

Disable FP8 in Mcore integration test on older GPUs #1357

Disable FP8 in Mcore integration test on older GPUs #1357

Conversation

timmoon10 commented Dec 5, 2024

Description

Type of change

Changes

Checklist:

timmoon10 commented Dec 5, 2024

sudhakarsingh27 left a comment

Choose a reason for hiding this comment

sudhakarsingh27 Dec 5, 2024

Choose a reason for hiding this comment

timmoon10 Dec 6, 2024

Choose a reason for hiding this comment

sudhakarsingh27 Dec 5, 2024

Choose a reason for hiding this comment

timmoon10 Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

sudhakarsingh27 Dec 5, 2024

Choose a reason for hiding this comment

timmoon10 Dec 6, 2024

Choose a reason for hiding this comment

timmoon10 Dec 6, 2024 •

edited

Loading