[QST] How to set MoE-specific TP size in recipe? #12103

umiswing · 2025-02-08T13:53:13Z

I want to set MoE-specific TP size in my custom recipe, however, while Megatron-Core provides such feature through expert-tensor-parallel-size, I don't see any args to pass expert-tensor-parallel-size in NeMo's code. Are there any way to set MoE-specific TP size in recipe directly instead of hacking NeMo's function call to fake_initialize_model_parallel and initialize_model_parallel?
@akoumpa @gdengk

The text was updated successfully, but these errors were encountered:

akoumpa · 2025-02-09T00:37:05Z

Hi @umiswing, thank you for your question.

I'll answer in two parts:

For expert-tensor-parallel-size, MegatronStrategy provides the expert_tensor_parallel_size argument that you can use to specify the desired ETP. See also Gao's PR on ETP. So therefore I would expect to be able to use this in a recipe right away. Please let me know if that answers your question.
ETP is already present in the ToT code of NeMo, so to use in a recipe you just need to modify the corresponding value, for example, you can use something like the following

from nemo.collections import llm
from functools import partial

# Load train recipe
recipe = partial(llm.mixtral_8x7b.pretrain_recipe)()

# Set expert tensor parallel size
recipe.trainer.strategy.expert_tensor_parallel_size = 4

Please let me know if that helps. Thank you.

umiswing · 2025-02-10T09:53:46Z

@akoumpa ,thanks for your reply!

Unfortunately, I still can't make it works. I get Unexpected error without detail information with following parallelism:

tp_size=8, pp_size=1, cp_size=1, vp_size=1, ep_size=8, etp_size=1, sequence_parallel=True,

Environment

I use container nvcr.io/nvidia/nemo:24.12, and change NeMo, TransformerEngine and Megatron-LM to the following commit:

NeMo commit: d2f6d7d
TransformerEngine commit: 7d576ed25266a17a7b651f2c12e8498f67e0baea
Megatron-LM tree: 0e85db539cf16816ffced6e7dac644d91ffadc04

Hardware: single node H800

My Recipe

Following is my modification to mixtral_8x7b's recipe, and I run it with nemo llm pretrain --factory mixtral_8x7b on a single node H800.

diff --git a/nemo/collections/llm/gpt/model/mixtral.py b/nemo/collections/llm/gpt/model/mixtral.py
index 5a0825087..12fc74ff6 100644
--- a/nemo/collections/llm/gpt/model/mixtral.py
+++ b/nemo/collections/llm/gpt/model/mixtral.py
@@ -100,7 +100,7 @@ class MixtralConfig8x7B(MixtralConfig):
     Official announcement: https://mistral.ai/news/mixtral-of-experts/
     """
 
-    num_layers: int = 32
+    num_layers: int = 8
     hidden_size: int = 4096
     ffn_hidden_size: int = 14336
     max_position_embeddings: int = 4096
diff --git a/nemo/collections/llm/recipes/mixtral_8x7b.py b/nemo/collections/llm/recipes/mixtral_8x7b.py
index 97a00b3d2..060f921db 100644
--- a/nemo/collections/llm/recipes/mixtral_8x7b.py
+++ b/nemo/collections/llm/recipes/mixtral_8x7b.py
@@ -33,6 +33,11 @@ from nemo.lightning.pytorch.callbacks.garbage_collection import GarbageCollectio
 from nemo.lightning.pytorch.callbacks.megatron_comm_overlap import MegatronCommOverlapCallback
 from nemo.lightning.pytorch.callbacks.moe_token_drop import MegatronTokenDropCallback
 from nemo.utils.exp_manager import TimingCallback
+from nemo.lightning.pytorch.callbacks import ModelCheckpoint
+from nemo.lightning.pytorch.callbacks.nsys import NsysCallback
+from typing import Dict, List, Optional
+from nemo.collections.llm.recipes.precision.mixed_precision import bf16_with_fp8_mixed
+from nemo.lightning.pytorch.callbacks.garbage_collection import GarbageCollectionCallback
 
 NAME = "mixtral_8x7b"
 
@@ -120,18 +125,21 @@ def trainer(
         ),
     )
 
+    callbacks.append(run.Config(ModelCheckpoint, save_last=False))
+
     trainer = run.Config(
         nl.Trainer,
         accelerator="gpu",
         accumulate_grad_batches=1,
         callbacks=callbacks,
         devices=num_gpus_per_node,
-        limit_test_batches=50,
-        limit_val_batches=32,
-        log_every_n_steps=10,
+        limit_test_batches=0.0,
+        limit_val_batches=0.0,
+        log_every_n_steps=1,
         max_steps=max_steps,
         num_nodes=num_nodes,
-        plugins=run.Config(nl.MegatronMixedPrecision, precision="bf16-mixed"),
+        # plugins=run.Config(nl.MegatronMixedPrecision, precision="bf16-mixed"),
+        plugins=run.Config(nl.MegatronMixedPrecision, precision="bf16-mixed", fp8="hybrid"),
         strategy=strategy,
         use_distributed_sampler=False,
         val_check_interval=2000,
@@ -139,6 +147,93 @@ def trainer(
 
     return trainer
 
+def get_comm_overlap_callback_idx(callbacks: List[Callback]):
+    """
+    nemo.lightning.Trainer has a list of callbacks defined. This method identifies index of MegatronCommOverlapCallback
+    from the list defined in recipes in nemo.collections.llm.recipes. The index is needed to override ddp communication
+    params
+    """
+    from nemo.collections.llm.recipes.llama3_8b import MegatronCommOverlapCallback as fuck_MegatronCommOverlapCallback
+    if callbacks:  # default is None in lightning
+        for idx, callback in enumerate(callbacks):
+            if isinstance(callback, fuck_MegatronCommOverlapCallback):
+                return idx
+    return -1
+
+def mixtral_8x7b_performance_recipe(
+    recipe: run.Partial,
+    compute_dtype: str,
+    num_nodes: int,
+    num_gpus_per_node: int,
+    mbs: int,
+    gbs: int,
+    tp_size: int,
+    pp_size: int,
+    cp_size: int,
+    vp_size: Optional[int],
+    ep_size: int,
+    etp_size: int,
+    max_steps: int,
+):
+    """
+    mixtral 8x7b pre-train recipe aimed at achieving best possible performance.
+
+    NOTE: Use fp8 precision training with caution. It might not give desirable results.
+    """
+    # recipe = pretrain_recipe(performance_mode=True)
+
+    # data module configs
+    recipe.data.micro_batch_size = mbs
+    recipe.data.global_batch_size = gbs
+    recipe.data.num_train_samples = max_steps * gbs * mbs  # ensure only 1 epoch for whole run
+    # recipe.data.tokenizer = hf_tokenizer("mistralai/Mixtral-8x7B-v0.1")
+
+    recipe.trainer.max_steps = max_steps
+    recipe.trainer.num_nodes = num_nodes
+    recipe.trainer.devices = num_gpus_per_node
+
+    # parallelism configs
+    recipe.trainer.strategy.tensor_model_parallel_size = tp_size
+    recipe.trainer.strategy.pipeline_model_parallel_size = pp_size
+    recipe.trainer.strategy.context_parallel_size = cp_size
+    recipe.trainer.strategy.virtual_pipeline_model_parallel_size = vp_size
+    recipe.trainer.strategy.expert_model_parallel_size = ep_size
+    recipe.trainer.strategy.expert_tensor_parallel_size = etp_size
+    if tp_size > 1:
+        recipe.trainer.strategy.sequence_parallel = True
+    else:
+        recipe.trainer.strategy.sequence_parallel = False
+
+    comm_overlap_callback_idx = get_comm_overlap_callback_idx(recipe.trainer.callbacks)
+
+    # compute dtype configs
+    if compute_dtype.lower() == "fp8":
+        recipe.trainer.plugins = bf16_with_fp8_mixed()
+    recipe.trainer.plugins.grad_reduce_in_fp32 = False  # bf16 grad dtype
+
+    # callback configs
+    garbage_collection_callback = run.Config(
+        GarbageCollectionCallback,
+        gc_interval_train=100,
+        gc_interval_val=500,
+    )
+    recipe.trainer.callbacks.extend(
+        [
+            garbage_collection_callback,
+        ]
+    )
+    dp_size = (num_nodes * num_gpus_per_node) / (tp_size * pp_size * cp_size)
+    if dp_size > 1 and pp_size > 1 and vp_size and vp_size > 1:
+        if comm_overlap_callback_idx >= 0:
+            recipe.trainer.callbacks[comm_overlap_callback_idx].overlap_param_gather_with_optimizer_step = True
+
+    # Misc. for overall faster experiment runtime
+    recipe.log.ckpt = None
+    # recipe.trainer.enable_checkpointing = False
+    recipe.trainer.val_check_interval = max_steps
+    recipe.trainer.log_every_n_steps = 1
+
+    return recipe
 
 @run.cli.factory(target=pretrain, name=NAME)
 def pretrain_recipe(
@@ -175,14 +270,19 @@ def pretrain_recipe(
             >>> recipe = pretrain_recipe(name="mixtral_8x7b_pretrain", num_nodes=8)
             >>> print(recipe)
     """
+
+    performance_mode = True
+    num_nodes = 1
+    num_gpus_per_node = 8
+
     recipe = run.Partial(
         fn,
         model=model(),
         trainer=trainer(
-            num_nodes=num_nodes, num_gpus_per_node=num_gpus_per_node, callbacks=[run.Config(TimingCallback)]
+            pipeline_parallelism_type=torch.bfloat16, num_nodes=num_nodes, num_gpus_per_node=num_gpus_per_node, callbacks=[run.Config(TimingCallback)]
         ),
-        data=run.Config(MockDataModule, seq_length=4096, global_batch_size=512, micro_batch_size=1),
-        log=default_log(dir=dir, name=name, tensorboard_logger=tensorboard_logger(name=name)),
+        data=run.Config(MockDataModule, seq_length=4096, global_batch_size=128, micro_batch_size=1),
+        log=run.Config(nl.NeMoLogger, log_dir="mixtral_8x7b_output"),
         optim=distributed_fused_adam_with_cosine_annealing(max_lr=3e-4),
         resume=default_resume(),
     )
@@ -190,6 +290,11 @@ def pretrain_recipe(
     if performance_mode:
         recipe = pretrain_performance_optimizations(recipe)
 
+    recipe = mixtral_8x7b_performance_recipe(recipe=recipe, compute_dtype="fp8", num_nodes=num_nodes,
+                                             num_gpus_per_node=num_gpus_per_node, mbs=1, gbs=128, tp_size=8,
+                                             pp_size=1, cp_size=1, vp_size=1, ep_size=8, etp_size=1, max_steps=15)
+
+
     return recipe

umiswing · 2025-02-11T07:35:48Z

Hello @akoumpa, could you please let me know if there are any updates on this issue? Thanks!

akoumpa · 2025-02-11T07:44:27Z

Hi @umiswing , thanks for checking in.

I’ve reached out internally for more information and am currently juggling a few other priorities, but I anticipate someone will have an update for you by the end of the week. I appreciate your patience and will be in touch soon.

suiyoubi · 2025-02-11T17:45:25Z

Hi @umiswing I am working with @akoumpa on this issue.

Would you mind sharing the detailed logs of your error you encountered?

Currently there is bug on Megatron-core side when you have ETP != TP, which will likely to be fixed soon.

Alternatively could you try if the following works ?

recipe = mixtral_8x7b_performance_recipe(recipe=recipe, compute_dtype="fp8", num_nodes=num_nodes,
                                             num_gpus_per_node=num_gpus_per_node, mbs=1, gbs=128, tp_size=1,
                                             pp_size=1, cp_size=1, vp_size=1, ep_size=8, etp_size=1, max_steps=15)

# OR

recipe = mixtral_8x7b_performance_recipe(recipe=recipe, compute_dtype="fp8", num_nodes=num_nodes,
                                             num_gpus_per_node=num_gpus_per_node, mbs=1, gbs=128, tp_size=8,
                                             pp_size=1, cp_size=1, vp_size=1, ep_size=1, etp_size=8, max_steps=15)

umiswing · 2025-02-12T06:19:51Z

Hi @suiyoubi

Would you mind sharing the detailed logs of your error you encountered?

Unfortunately, NeMo reports no detailed logs, only Unexpected error is raised.

Alternatively could you try if the following works ?

Yes, both of the configs provided in your code snippets work.

suiyoubi · 2025-02-12T14:43:20Z

Ok then this is a known issue on the Mcore side that we are fixing it now.

For now, TP needs to be the same as ETP.
Will update you once it gets fixed on Mcore.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] How to set MoE-specific TP size in recipe? #12103

[QST] How to set MoE-specific TP size in recipe? #12103

umiswing commented Feb 8, 2025 •

edited

Loading

akoumpa commented Feb 9, 2025 •

edited

Loading

umiswing commented Feb 10, 2025 •

edited

Loading

umiswing commented Feb 11, 2025

akoumpa commented Feb 11, 2025

suiyoubi commented Feb 11, 2025

umiswing commented Feb 12, 2025

suiyoubi commented Feb 12, 2025

[QST] How to set MoE-specific TP size in recipe? #12103

[QST] How to set MoE-specific TP size in recipe? #12103

Comments

umiswing commented Feb 8, 2025 • edited Loading

akoumpa commented Feb 9, 2025 • edited Loading

umiswing commented Feb 10, 2025 • edited Loading

Environment

My Recipe

umiswing commented Feb 11, 2025

akoumpa commented Feb 11, 2025

suiyoubi commented Feb 11, 2025

umiswing commented Feb 12, 2025

suiyoubi commented Feb 12, 2025

umiswing commented Feb 8, 2025 •

edited

Loading

akoumpa commented Feb 9, 2025 •

edited

Loading

umiswing commented Feb 10, 2025 •

edited

Loading