-
Notifications
You must be signed in to change notification settings - Fork 562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mixture of experts pretraining benchmark #780
base: master
Are you sure you want to change the base?
Conversation
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
dcd532b
to
1cc20e7
Compare
1cc20e7
to
6f0f836
Compare
* fix(moe): Added weight decay parameter * fix(moe): Added proper handling of device count per node * refactor(moe): Data preprocessing cleanup * fix(moe): This container has more stable convergence * fix(gpu): data preprocessing * build(gpu): Fix container image to specific version of NeMo and Megatron
Thank you @ShriyaPalsamudram for the reviewing. Could you help me merge the PR when you think it is in good shape since I don't have authorization. Not yet covered nemo 2.0 GPU guides and probably we can add it later in a separate PR:
|
For sure, I will give the current GPU guide a try over the weekend!
|
* docs(moe): GPU running and slurm docs * docs: Fixed markup
--output_dir <path to save checkpoint> --hf_token <your token to HF repository> | ||
``` | ||
|
||
This script will download specified checkpoint from huggingface repository, preprocess it and save |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a step to verify checksums of the converted checkpoint to ensure correctness?
Is this converted checkpoint available for download directly from mlcommons drive? If yes, can those instructions be shared here as well?
This script will download specified checkpoint from huggingface repository, preprocess it and save | ||
into specified directory | ||
|
||
To preprocess dataset, use dataset_preprocessing.py script |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as checkpoint
Description
Add MoE benchmark to mlcommons repo.
todo list
TPU
GPU
General
cc @suexu1025 @ShriyaPalsamudram