Mixture of experts pretraining benchmark #780

ZhiyuLi-goog · 2025-01-06T10:57:55Z

Description

Add MoE benchmark to mlcommons repo.

todo list

TPU

docker image verification
run workload in small scale
run workload in large scale

GPU

update nemo 2.0 GPU guides: @hXl3s could you help update it @JustinPan-goog could you take a try and review the update guides? Thank you both!
pending issues in original repo: https://github.com/ZhiyuLi-goog/MoE_study/issues @hXl3s
freeze gpu/cuda docker image @hXl3s

General

Training v5.0 artifacts uploading: dataset, converted ckpt, docker images etc @ZhiyuLi-goog
Update guides with the above shared link @ZhiyuLi-goog
double check license @ZhiyuLi-goog

cc @suexu1025 @ShriyaPalsamudram

github-actions · 2025-01-06T10:58:10Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

MoE_language_model/README.md

MoE_language_model/docker/gpu/Dockerfile

MoE_language_model/docker/tpu/Dockerfile

MoE_language_model/mlperf_logging_utils.py

…pretraining

* fix(moe): Added weight decay parameter * fix(moe): Added proper handling of device count per node * refactor(moe): Data preprocessing cleanup * fix(moe): This container has more stable convergence * fix(gpu): data preprocessing * build(gpu): Fix container image to specific version of NeMo and Megatron

mixture_of_experts_pretraining/README.md

ZhiyuLi-goog · 2025-01-10T23:07:11Z

Thank you @ShriyaPalsamudram for the reviewing.

Could you help me merge the PR when you think it is in good shape since I don't have authorization.

Not yet covered nemo 2.0 GPU guides and probably we can add it later in a separate PR:

update nemo 2.0 GPU guides: @hXl3s could you help update it @JustinPan-goog could you take a try and review the update guides? Thank you both!

JustinPan-goog · 2025-01-10T23:56:03Z

For sure, I will give the current GPU guide a try over the weekend!

Thank you @ShriyaPalsamudram for the reviewing.

Could you help me merge the PR when you think it is in good shape since I don't have authorization.

Not yet covered nemo 2.0 GPU guides and probably we can add it later in a separate PR:

update nemo 2.0 GPU guides: @hXl3s could you help update it @JustinPan-goog could you take a try and review the update guides? Thank you both!

* docs(moe): GPU running and slurm docs * docs: Fixed markup

ShriyaPalsamudram · 2025-01-14T19:39:46Z

mixture_of_experts_pretraining/README.md

+    --output_dir <path to save checkpoint> --hf_token <your token to HF repository>
+```
+
+This script will download specified checkpoint from huggingface repository, preprocess it and save


Is there a step to verify checksums of the converted checkpoint to ensure correctness?

Is this converted checkpoint available for download directly from mlcommons drive? If yes, can those instructions be shared here as well?

ShriyaPalsamudram · 2025-01-14T19:39:56Z

mixture_of_experts_pretraining/README.md

+This script will download specified checkpoint from huggingface repository, preprocess it and save
+into specified directory
+
+To preprocess dataset, use dataset_preprocessing.py script


Same as checkpoint

ZhiyuLi-goog requested a review from a team as a code owner January 6, 2025 10:57

ZhiyuLi-goog force-pushed the lizhiyu/moe branch from dcd532b to 1cc20e7 Compare January 7, 2025 00:13

ZhiyuLi-goog added 2 commits January 9, 2025 06:06

MoE Reference Implementation

97b8b7e

update relative path in docker file for repo migration

6f0f836

ZhiyuLi-goog force-pushed the lizhiyu/moe branch from 1cc20e7 to 6f0f836 Compare January 9, 2025 14:09