Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixture of experts pretraining benchmark #780

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

ZhiyuLi-goog
Copy link

@ZhiyuLi-goog ZhiyuLi-goog commented Jan 6, 2025

Description

Add MoE benchmark to mlcommons repo.

todo list

TPU

  • docker image verification
  • run workload in small scale
  • run workload in large scale

GPU

General

cc @suexu1025 @ShriyaPalsamudram

@ZhiyuLi-goog ZhiyuLi-goog requested a review from a team as a code owner January 6, 2025 10:57
Copy link

github-actions bot commented Jan 6, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

ZhiyuLi-goog and others added 7 commits January 10, 2025 03:27
* fix(moe): Added weight decay parameter

* fix(moe): Added proper handling of device count per node

* refactor(moe): Data preprocessing cleanup

* fix(moe): This container has more stable convergence

* fix(gpu): data preprocessing

* build(gpu): Fix container image to specific version of NeMo and Megatron
@ZhiyuLi-goog
Copy link
Author

Thank you @ShriyaPalsamudram for the reviewing.

Could you help me merge the PR when you think it is in good shape since I don't have authorization.

Not yet covered nemo 2.0 GPU guides and probably we can add it later in a separate PR:

  • update nemo 2.0 GPU guides: @hXl3s could you help update it @JustinPan-goog could you take a try and review the update guides? Thank you both!

@ZhiyuLi-goog ZhiyuLi-goog changed the title [Draft] MoE Benchmark MoE Benchmark Jan 10, 2025
@ZhiyuLi-goog ZhiyuLi-goog changed the title MoE Benchmark mixture_of_experts_pretraining Jan 10, 2025
@ZhiyuLi-goog ZhiyuLi-goog changed the title mixture_of_experts_pretraining Mixture of experts pretraining benchmark Jan 10, 2025
@JustinPan-goog
Copy link

For sure, I will give the current GPU guide a try over the weekend!

Thank you @ShriyaPalsamudram for the reviewing.

Could you help me merge the PR when you think it is in good shape since I don't have authorization.

Not yet covered nemo 2.0 GPU guides and probably we can add it later in a separate PR:

  • update nemo 2.0 GPU guides: @hXl3s could you help update it @JustinPan-goog could you take a try and review the update guides? Thank you both!

* docs(moe): GPU running and slurm docs

* docs: Fixed markup
--output_dir <path to save checkpoint> --hf_token <your token to HF repository>
```

This script will download specified checkpoint from huggingface repository, preprocess it and save
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a step to verify checksums of the converted checkpoint to ensure correctness?

Is this converted checkpoint available for download directly from mlcommons drive? If yes, can those instructions be shared here as well?

This script will download specified checkpoint from huggingface repository, preprocess it and save
into specified directory

To preprocess dataset, use dataset_preprocessing.py script
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as checkpoint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants