-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] How to set MoE-specific TP size in recipe? #12103
Comments
Hi @umiswing, thank you for your question. I'll answer in two parts:
Please let me know if that helps. Thank you. |
@akoumpa ,thanks for your reply! Unfortunately, I still can't make it works. I get
EnvironmentI use container NeMo commit: d2f6d7d Hardware: single node H800 My RecipeFollowing is my modification to mixtral_8x7b's recipe, and I run it with diff --git a/nemo/collections/llm/gpt/model/mixtral.py b/nemo/collections/llm/gpt/model/mixtral.py
index 5a0825087..12fc74ff6 100644
--- a/nemo/collections/llm/gpt/model/mixtral.py
+++ b/nemo/collections/llm/gpt/model/mixtral.py
@@ -100,7 +100,7 @@ class MixtralConfig8x7B(MixtralConfig):
Official announcement: https://mistral.ai/news/mixtral-of-experts/
"""
- num_layers: int = 32
+ num_layers: int = 8
hidden_size: int = 4096
ffn_hidden_size: int = 14336
max_position_embeddings: int = 4096
diff --git a/nemo/collections/llm/recipes/mixtral_8x7b.py b/nemo/collections/llm/recipes/mixtral_8x7b.py
index 97a00b3d2..060f921db 100644
--- a/nemo/collections/llm/recipes/mixtral_8x7b.py
+++ b/nemo/collections/llm/recipes/mixtral_8x7b.py
@@ -33,6 +33,11 @@ from nemo.lightning.pytorch.callbacks.garbage_collection import GarbageCollectio
from nemo.lightning.pytorch.callbacks.megatron_comm_overlap import MegatronCommOverlapCallback
from nemo.lightning.pytorch.callbacks.moe_token_drop import MegatronTokenDropCallback
from nemo.utils.exp_manager import TimingCallback
+from nemo.lightning.pytorch.callbacks import ModelCheckpoint
+from nemo.lightning.pytorch.callbacks.nsys import NsysCallback
+from typing import Dict, List, Optional
+from nemo.collections.llm.recipes.precision.mixed_precision import bf16_with_fp8_mixed
+from nemo.lightning.pytorch.callbacks.garbage_collection import GarbageCollectionCallback
NAME = "mixtral_8x7b"
@@ -120,18 +125,21 @@ def trainer(
),
)
+ callbacks.append(run.Config(ModelCheckpoint, save_last=False))
+
trainer = run.Config(
nl.Trainer,
accelerator="gpu",
accumulate_grad_batches=1,
callbacks=callbacks,
devices=num_gpus_per_node,
- limit_test_batches=50,
- limit_val_batches=32,
- log_every_n_steps=10,
+ limit_test_batches=0.0,
+ limit_val_batches=0.0,
+ log_every_n_steps=1,
max_steps=max_steps,
num_nodes=num_nodes,
- plugins=run.Config(nl.MegatronMixedPrecision, precision="bf16-mixed"),
+ # plugins=run.Config(nl.MegatronMixedPrecision, precision="bf16-mixed"),
+ plugins=run.Config(nl.MegatronMixedPrecision, precision="bf16-mixed", fp8="hybrid"),
strategy=strategy,
use_distributed_sampler=False,
val_check_interval=2000,
@@ -139,6 +147,93 @@ def trainer(
return trainer
+def get_comm_overlap_callback_idx(callbacks: List[Callback]):
+ """
+ nemo.lightning.Trainer has a list of callbacks defined. This method identifies index of MegatronCommOverlapCallback
+ from the list defined in recipes in nemo.collections.llm.recipes. The index is needed to override ddp communication
+ params
+ """
+ from nemo.collections.llm.recipes.llama3_8b import MegatronCommOverlapCallback as fuck_MegatronCommOverlapCallback
+ if callbacks: # default is None in lightning
+ for idx, callback in enumerate(callbacks):
+ if isinstance(callback, fuck_MegatronCommOverlapCallback):
+ return idx
+ return -1
+
+def mixtral_8x7b_performance_recipe(
+ recipe: run.Partial,
+ compute_dtype: str,
+ num_nodes: int,
+ num_gpus_per_node: int,
+ mbs: int,
+ gbs: int,
+ tp_size: int,
+ pp_size: int,
+ cp_size: int,
+ vp_size: Optional[int],
+ ep_size: int,
+ etp_size: int,
+ max_steps: int,
+):
+ """
+ mixtral 8x7b pre-train recipe aimed at achieving best possible performance.
+
+ NOTE: Use fp8 precision training with caution. It might not give desirable results.
+ """
+ # recipe = pretrain_recipe(performance_mode=True)
+
+ # data module configs
+ recipe.data.micro_batch_size = mbs
+ recipe.data.global_batch_size = gbs
+ recipe.data.num_train_samples = max_steps * gbs * mbs # ensure only 1 epoch for whole run
+ # recipe.data.tokenizer = hf_tokenizer("mistralai/Mixtral-8x7B-v0.1")
+
+ recipe.trainer.max_steps = max_steps
+ recipe.trainer.num_nodes = num_nodes
+ recipe.trainer.devices = num_gpus_per_node
+
+ # parallelism configs
+ recipe.trainer.strategy.tensor_model_parallel_size = tp_size
+ recipe.trainer.strategy.pipeline_model_parallel_size = pp_size
+ recipe.trainer.strategy.context_parallel_size = cp_size
+ recipe.trainer.strategy.virtual_pipeline_model_parallel_size = vp_size
+ recipe.trainer.strategy.expert_model_parallel_size = ep_size
+ recipe.trainer.strategy.expert_tensor_parallel_size = etp_size
+ if tp_size > 1:
+ recipe.trainer.strategy.sequence_parallel = True
+ else:
+ recipe.trainer.strategy.sequence_parallel = False
+
+ comm_overlap_callback_idx = get_comm_overlap_callback_idx(recipe.trainer.callbacks)
+
+ # compute dtype configs
+ if compute_dtype.lower() == "fp8":
+ recipe.trainer.plugins = bf16_with_fp8_mixed()
+ recipe.trainer.plugins.grad_reduce_in_fp32 = False # bf16 grad dtype
+
+ # callback configs
+ garbage_collection_callback = run.Config(
+ GarbageCollectionCallback,
+ gc_interval_train=100,
+ gc_interval_val=500,
+ )
+ recipe.trainer.callbacks.extend(
+ [
+ garbage_collection_callback,
+ ]
+ )
+ dp_size = (num_nodes * num_gpus_per_node) / (tp_size * pp_size * cp_size)
+ if dp_size > 1 and pp_size > 1 and vp_size and vp_size > 1:
+ if comm_overlap_callback_idx >= 0:
+ recipe.trainer.callbacks[comm_overlap_callback_idx].overlap_param_gather_with_optimizer_step = True
+
+ # Misc. for overall faster experiment runtime
+ recipe.log.ckpt = None
+ # recipe.trainer.enable_checkpointing = False
+ recipe.trainer.val_check_interval = max_steps
+ recipe.trainer.log_every_n_steps = 1
+
+ return recipe
@run.cli.factory(target=pretrain, name=NAME)
def pretrain_recipe(
@@ -175,14 +270,19 @@ def pretrain_recipe(
>>> recipe = pretrain_recipe(name="mixtral_8x7b_pretrain", num_nodes=8)
>>> print(recipe)
"""
+
+ performance_mode = True
+ num_nodes = 1
+ num_gpus_per_node = 8
+
recipe = run.Partial(
fn,
model=model(),
trainer=trainer(
- num_nodes=num_nodes, num_gpus_per_node=num_gpus_per_node, callbacks=[run.Config(TimingCallback)]
+ pipeline_parallelism_type=torch.bfloat16, num_nodes=num_nodes, num_gpus_per_node=num_gpus_per_node, callbacks=[run.Config(TimingCallback)]
),
- data=run.Config(MockDataModule, seq_length=4096, global_batch_size=512, micro_batch_size=1),
- log=default_log(dir=dir, name=name, tensorboard_logger=tensorboard_logger(name=name)),
+ data=run.Config(MockDataModule, seq_length=4096, global_batch_size=128, micro_batch_size=1),
+ log=run.Config(nl.NeMoLogger, log_dir="mixtral_8x7b_output"),
optim=distributed_fused_adam_with_cosine_annealing(max_lr=3e-4),
resume=default_resume(),
)
@@ -190,6 +290,11 @@ def pretrain_recipe(
if performance_mode:
recipe = pretrain_performance_optimizations(recipe)
+ recipe = mixtral_8x7b_performance_recipe(recipe=recipe, compute_dtype="fp8", num_nodes=num_nodes,
+ num_gpus_per_node=num_gpus_per_node, mbs=1, gbs=128, tp_size=8,
+ pp_size=1, cp_size=1, vp_size=1, ep_size=8, etp_size=1, max_steps=15)
+
+
return recipe
|
Hello @akoumpa, could you please let me know if there are any updates on this issue? Thanks! |
Hi @umiswing , thanks for checking in. I’ve reached out internally for more information and am currently juggling a few other priorities, but I anticipate someone will have an update for you by the end of the week. I appreciate your patience and will be in touch soon. |
Hi @umiswing I am working with @akoumpa on this issue. Would you mind sharing the detailed logs of your error you encountered? Currently there is bug on Megatron-core side when you have ETP != TP, which will likely to be fixed soon. Alternatively could you try if the following works ?
|
Hi @suiyoubi
Unfortunately, NeMo reports no detailed logs, only
Yes, both of the configs provided in your code snippets work. |
Ok then this is a known issue on the Mcore side that we are fixing it now. For now, TP needs to be the same as ETP. |
I want to set MoE-specific TP size in my custom recipe, however, while Megatron-Core provides such feature through
expert-tensor-parallel-size
, I don't see any args to passexpert-tensor-parallel-size
in NeMo's code. Are there any way to set MoE-specific TP size in recipe directly instead of hacking NeMo's function call tofake_initialize_model_parallel
andinitialize_model_parallel
?@akoumpa @gdengk
The text was updated successfully, but these errors were encountered: