Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] How to set MoE-specific TP size in recipe? #12103

Open
umiswing opened this issue Feb 8, 2025 · 7 comments
Open

[QST] How to set MoE-specific TP size in recipe? #12103

umiswing opened this issue Feb 8, 2025 · 7 comments

Comments

@umiswing
Copy link

umiswing commented Feb 8, 2025

I want to set MoE-specific TP size in my custom recipe, however, while Megatron-Core provides such feature through expert-tensor-parallel-size, I don't see any args to pass expert-tensor-parallel-size in NeMo's code. Are there any way to set MoE-specific TP size in recipe directly instead of hacking NeMo's function call to fake_initialize_model_parallel and initialize_model_parallel?
@akoumpa @gdengk

@akoumpa
Copy link
Member

akoumpa commented Feb 9, 2025

Hi @umiswing, thank you for your question.

I'll answer in two parts:

  1. For expert-tensor-parallel-size, MegatronStrategy provides the expert_tensor_parallel_size argument that you can use to specify the desired ETP. See also Gao's PR on ETP. So therefore I would expect to be able to use this in a recipe right away. Please let me know if that answers your question.
  2. ETP is already present in the ToT code of NeMo, so to use in a recipe you just need to modify the corresponding value, for example, you can use something like the following
from nemo.collections import llm
from functools import partial

# Load train recipe
recipe = partial(llm.mixtral_8x7b.pretrain_recipe)()

# Set expert tensor parallel size
recipe.trainer.strategy.expert_tensor_parallel_size = 4

Please let me know if that helps. Thank you.

@umiswing
Copy link
Author

umiswing commented Feb 10, 2025

@akoumpa ,thanks for your reply!

Unfortunately, I still can't make it works. I get Unexpected error without detail information with following parallelism:

tp_size=8, pp_size=1, cp_size=1, vp_size=1, ep_size=8, etp_size=1, sequence_parallel=True,

Environment

I use container nvcr.io/nvidia/nemo:24.12, and change NeMo, TransformerEngine and Megatron-LM to the following commit:

NeMo commit: d2f6d7d
TransformerEngine commit: 7d576ed25266a17a7b651f2c12e8498f67e0baea
Megatron-LM tree: 0e85db539cf16816ffced6e7dac644d91ffadc04

Hardware: single node H800

My Recipe

Following is my modification to mixtral_8x7b's recipe, and I run it with nemo llm pretrain --factory mixtral_8x7b on a single node H800.

diff --git a/nemo/collections/llm/gpt/model/mixtral.py b/nemo/collections/llm/gpt/model/mixtral.py
index 5a0825087..12fc74ff6 100644
--- a/nemo/collections/llm/gpt/model/mixtral.py
+++ b/nemo/collections/llm/gpt/model/mixtral.py
@@ -100,7 +100,7 @@ class MixtralConfig8x7B(MixtralConfig):
     Official announcement: https://mistral.ai/news/mixtral-of-experts/
     """
 
-    num_layers: int = 32
+    num_layers: int = 8
     hidden_size: int = 4096
     ffn_hidden_size: int = 14336
     max_position_embeddings: int = 4096
diff --git a/nemo/collections/llm/recipes/mixtral_8x7b.py b/nemo/collections/llm/recipes/mixtral_8x7b.py
index 97a00b3d2..060f921db 100644
--- a/nemo/collections/llm/recipes/mixtral_8x7b.py
+++ b/nemo/collections/llm/recipes/mixtral_8x7b.py
@@ -33,6 +33,11 @@ from nemo.lightning.pytorch.callbacks.garbage_collection import GarbageCollectio
 from nemo.lightning.pytorch.callbacks.megatron_comm_overlap import MegatronCommOverlapCallback
 from nemo.lightning.pytorch.callbacks.moe_token_drop import MegatronTokenDropCallback
 from nemo.utils.exp_manager import TimingCallback
+from nemo.lightning.pytorch.callbacks import ModelCheckpoint
+from nemo.lightning.pytorch.callbacks.nsys import NsysCallback
+from typing import Dict, List, Optional
+from nemo.collections.llm.recipes.precision.mixed_precision import bf16_with_fp8_mixed
+from nemo.lightning.pytorch.callbacks.garbage_collection import GarbageCollectionCallback
 
 NAME = "mixtral_8x7b"
 
@@ -120,18 +125,21 @@ def trainer(
         ),
     )
 
+    callbacks.append(run.Config(ModelCheckpoint, save_last=False))
+
     trainer = run.Config(
         nl.Trainer,
         accelerator="gpu",
         accumulate_grad_batches=1,
         callbacks=callbacks,
         devices=num_gpus_per_node,
-        limit_test_batches=50,
-        limit_val_batches=32,
-        log_every_n_steps=10,
+        limit_test_batches=0.0,
+        limit_val_batches=0.0,
+        log_every_n_steps=1,
         max_steps=max_steps,
         num_nodes=num_nodes,
-        plugins=run.Config(nl.MegatronMixedPrecision, precision="bf16-mixed"),
+        # plugins=run.Config(nl.MegatronMixedPrecision, precision="bf16-mixed"),
+        plugins=run.Config(nl.MegatronMixedPrecision, precision="bf16-mixed", fp8="hybrid"),
         strategy=strategy,
         use_distributed_sampler=False,
         val_check_interval=2000,
@@ -139,6 +147,93 @@ def trainer(
 
     return trainer
 
+def get_comm_overlap_callback_idx(callbacks: List[Callback]):
+    """
+    nemo.lightning.Trainer has a list of callbacks defined. This method identifies index of MegatronCommOverlapCallback
+    from the list defined in recipes in nemo.collections.llm.recipes. The index is needed to override ddp communication
+    params
+    """
+    from nemo.collections.llm.recipes.llama3_8b import MegatronCommOverlapCallback as fuck_MegatronCommOverlapCallback
+    if callbacks:  # default is None in lightning
+        for idx, callback in enumerate(callbacks):
+            if isinstance(callback, fuck_MegatronCommOverlapCallback):
+                return idx
+    return -1
+
+def mixtral_8x7b_performance_recipe(
+    recipe: run.Partial,
+    compute_dtype: str,
+    num_nodes: int,
+    num_gpus_per_node: int,
+    mbs: int,
+    gbs: int,
+    tp_size: int,
+    pp_size: int,
+    cp_size: int,
+    vp_size: Optional[int],
+    ep_size: int,
+    etp_size: int,
+    max_steps: int,
+):
+    """
+    mixtral 8x7b pre-train recipe aimed at achieving best possible performance.
+
+    NOTE: Use fp8 precision training with caution. It might not give desirable results.
+    """
+    # recipe = pretrain_recipe(performance_mode=True)
+
+    # data module configs
+    recipe.data.micro_batch_size = mbs
+    recipe.data.global_batch_size = gbs
+    recipe.data.num_train_samples = max_steps * gbs * mbs  # ensure only 1 epoch for whole run
+    # recipe.data.tokenizer = hf_tokenizer("mistralai/Mixtral-8x7B-v0.1")
+
+    recipe.trainer.max_steps = max_steps
+    recipe.trainer.num_nodes = num_nodes
+    recipe.trainer.devices = num_gpus_per_node
+
+    # parallelism configs
+    recipe.trainer.strategy.tensor_model_parallel_size = tp_size
+    recipe.trainer.strategy.pipeline_model_parallel_size = pp_size
+    recipe.trainer.strategy.context_parallel_size = cp_size
+    recipe.trainer.strategy.virtual_pipeline_model_parallel_size = vp_size
+    recipe.trainer.strategy.expert_model_parallel_size = ep_size
+    recipe.trainer.strategy.expert_tensor_parallel_size = etp_size
+    if tp_size > 1:
+        recipe.trainer.strategy.sequence_parallel = True
+    else:
+        recipe.trainer.strategy.sequence_parallel = False
+
+    comm_overlap_callback_idx = get_comm_overlap_callback_idx(recipe.trainer.callbacks)
+
+    # compute dtype configs
+    if compute_dtype.lower() == "fp8":
+        recipe.trainer.plugins = bf16_with_fp8_mixed()
+    recipe.trainer.plugins.grad_reduce_in_fp32 = False  # bf16 grad dtype
+
+    # callback configs
+    garbage_collection_callback = run.Config(
+        GarbageCollectionCallback,
+        gc_interval_train=100,
+        gc_interval_val=500,
+    )
+    recipe.trainer.callbacks.extend(
+        [
+            garbage_collection_callback,
+        ]
+    )
+    dp_size = (num_nodes * num_gpus_per_node) / (tp_size * pp_size * cp_size)
+    if dp_size > 1 and pp_size > 1 and vp_size and vp_size > 1:
+        if comm_overlap_callback_idx >= 0:
+            recipe.trainer.callbacks[comm_overlap_callback_idx].overlap_param_gather_with_optimizer_step = True
+
+    # Misc. for overall faster experiment runtime
+    recipe.log.ckpt = None
+    # recipe.trainer.enable_checkpointing = False
+    recipe.trainer.val_check_interval = max_steps
+    recipe.trainer.log_every_n_steps = 1
+
+    return recipe
 
 @run.cli.factory(target=pretrain, name=NAME)
 def pretrain_recipe(
@@ -175,14 +270,19 @@ def pretrain_recipe(
             >>> recipe = pretrain_recipe(name="mixtral_8x7b_pretrain", num_nodes=8)
             >>> print(recipe)
     """
+
+    performance_mode = True
+    num_nodes = 1
+    num_gpus_per_node = 8
+
     recipe = run.Partial(
         fn,
         model=model(),
         trainer=trainer(
-            num_nodes=num_nodes, num_gpus_per_node=num_gpus_per_node, callbacks=[run.Config(TimingCallback)]
+            pipeline_parallelism_type=torch.bfloat16, num_nodes=num_nodes, num_gpus_per_node=num_gpus_per_node, callbacks=[run.Config(TimingCallback)]
         ),
-        data=run.Config(MockDataModule, seq_length=4096, global_batch_size=512, micro_batch_size=1),
-        log=default_log(dir=dir, name=name, tensorboard_logger=tensorboard_logger(name=name)),
+        data=run.Config(MockDataModule, seq_length=4096, global_batch_size=128, micro_batch_size=1),
+        log=run.Config(nl.NeMoLogger, log_dir="mixtral_8x7b_output"),
         optim=distributed_fused_adam_with_cosine_annealing(max_lr=3e-4),
         resume=default_resume(),
     )
@@ -190,6 +290,11 @@ def pretrain_recipe(
     if performance_mode:
         recipe = pretrain_performance_optimizations(recipe)
 
+    recipe = mixtral_8x7b_performance_recipe(recipe=recipe, compute_dtype="fp8", num_nodes=num_nodes,
+                                             num_gpus_per_node=num_gpus_per_node, mbs=1, gbs=128, tp_size=8,
+                                             pp_size=1, cp_size=1, vp_size=1, ep_size=8, etp_size=1, max_steps=15)
+
+
     return recipe
 
 

@umiswing
Copy link
Author

Hello @akoumpa, could you please let me know if there are any updates on this issue? Thanks!

@akoumpa
Copy link
Member

akoumpa commented Feb 11, 2025

Hi @umiswing , thanks for checking in.

I’ve reached out internally for more information and am currently juggling a few other priorities, but I anticipate someone will have an update for you by the end of the week. I appreciate your patience and will be in touch soon.

@suiyoubi
Copy link
Collaborator

Hi @umiswing I am working with @akoumpa on this issue.

Would you mind sharing the detailed logs of your error you encountered?

Currently there is bug on Megatron-core side when you have ETP != TP, which will likely to be fixed soon.

Alternatively could you try if the following works ?

recipe = mixtral_8x7b_performance_recipe(recipe=recipe, compute_dtype="fp8", num_nodes=num_nodes,
                                             num_gpus_per_node=num_gpus_per_node, mbs=1, gbs=128, tp_size=1,
                                             pp_size=1, cp_size=1, vp_size=1, ep_size=8, etp_size=1, max_steps=15)

# OR

recipe = mixtral_8x7b_performance_recipe(recipe=recipe, compute_dtype="fp8", num_nodes=num_nodes,
                                             num_gpus_per_node=num_gpus_per_node, mbs=1, gbs=128, tp_size=8,
                                             pp_size=1, cp_size=1, vp_size=1, ep_size=1, etp_size=8, max_steps=15)

@umiswing
Copy link
Author

Hi @suiyoubi

Would you mind sharing the detailed logs of your error you encountered?

Unfortunately, NeMo reports no detailed logs, only Unexpected error is raised.

Alternatively could you try if the following works ?

Yes, both of the configs provided in your code snippets work.

@suiyoubi
Copy link
Collaborator

Ok then this is a known issue on the Mcore side that we are fixing it now.

For now, TP needs to be the same as ETP.
Will update you once it gets fixed on Mcore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants