-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable FP8 in Mcore integration test on older GPUs #1357
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
Megatron-LM | ||
vocab.json |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
#version: 0.2 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,13 +8,27 @@ set -e | |
: ${TE_PATH:=/opt/transformerengine} | ||
: ${MCORE_PATH:=${TE_PATH}/qa/L1_pytorch_mcore_integration/Megatron-LM} | ||
|
||
# Check whether FP8 is supported | ||
DEVICE_ARCH=$(nvidia-smi --query-gpu=compute_cap --format=csv,noheader | head -n 1 | sed 's/[^0-9]//g') | ||
if [[ ${DEVICE_ARCH} -ge 89 ]]; then | ||
WITH_FP8=1 | ||
fi | ||
|
||
# Download Megatron-LM if needed | ||
if [ ! -d "${MCORE_PATH}" ]; then | ||
pushd $(dirname ${MCORE_PATH}) | ||
git clone -b core_r0.9.0 https://github.com/NVIDIA/Megatron-LM.git Megatron-LM | ||
popd | ||
fi | ||
|
||
# Create mock vocab | ||
VOCAB_FILE=${TE_PATH}/qa/L1_pytorch_mcore_integration/vocab.json | ||
printf "" > ${VOCAB_FILE} | ||
printf "{" >> ${VOCAB_FILE} | ||
printf "\"<|endoftext|>\": 0" >> ${VOCAB_FILE} | ||
seq 1 4095 | awk '{ printf(", \"%d\": %d", $1, $1) }' >> ${VOCAB_FILE} | ||
printf "}" >> ${VOCAB_FILE} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wow, that's enough to generate a vocab file! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is somewhat specific to the Mcore mock GPT dataset: https://github.com/NVIDIA/Megatron-LM/blob/bd677bfb13ac2f19deaa927adc6da6f9201d66aa/megatron/core/datasets/gpt_dataset.py#L693 |
||
|
||
# Megatron-LM invocation | ||
COMMAND=" | ||
NVTE_TORCH_COMPILE=0 | ||
|
@@ -40,17 +54,17 @@ ${MCORE_PATH}/pretrain_gpt.py | |
--hidden-size 128 | ||
--num-attention-heads 8 | ||
--seq-length 128 | ||
--max-position-embeddings 2048 | ||
--max-position-embeddings 128 | ||
--micro-batch-size 1 | ||
--global-batch-size 8 | ||
--train-iters 10 | ||
--eval-iters 10 | ||
--lr 1e-4 | ||
--mock-data | ||
--vocab-file /data/gpt3/pile-cc1-cc2-shuf/bpe/gpt2-vocab.json | ||
--merge-file /data/gpt3/pile-cc1-cc2-shuf/bpe/gpt2-merges.txt | ||
--vocab-file ${VOCAB_FILE} | ||
--merge-file ${TE_PATH}/qa/L1_pytorch_mcore_integration/merges.txt | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure what's the purpose of this file, but is it okay for it to only have version info? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm treating |
||
--transformer-impl transformer_engine | ||
--fp8-format hybrid | ||
${WITH_FP8:+--fp8-format hybrid} | ||
" | ||
COMMAND=$(echo "${COMMAND}" | tr '\n' ' ') | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess then we have to manually update this branch. Is this okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's an argument for just downloading the Megatron-LM
main
branch, but that is orthogonal to this PR.