feat: Add HybridEP support for MoE expert parallelism#1942
feat: Add HybridEP support for MoE expert parallelism#1942
Conversation
- Update DeepEP dependency to hybrid-ep branch for HybridEP support - automodel, vllm, mcore dependency groups updated - Add HybridEP configuration options in _apply_moe_config(): - moe_flex_dispatcher_backend: Flex dispatcher backend (e.g., 'hybridep') - moe_hybridep_num_sms: Number of SMs for HybridEP operations Usage in config: policy.megatron_cfg.moe_token_dispatcher_type=flex policy.megatron_cfg.moe_flex_dispatcher_backend=hybridep policy.megatron_cfg.moe_hybridep_num_sms=32 See: https://github.com/deepseek-ai/DeepEP/tree/hybrid-ep
📝 WalkthroughWalkthroughThe changes add two conditional configuration hooks to the MoE setup function for optional dispatcher and HybridEP parameters, and update the DeepEP dependency reference from a specific commit to the hybrid-ep branch across multiple dependency groups in the project configuration. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pyproject.toml (1)
324-327:⚠️ Potential issue | 🟠 MajorStale
dependency-metadataversion fordeep_ep.The dependency is pinned to the
hybrid-epbranch (which dynamically generates its version from the current commit hash viagit rev-parse --short HEAD), but thedependency-metadataversion is statically set tov1.2.1+bfded34. This means the metadata version will become stale whenever the branch advances, potentially causing uv resolver failures.Either:
- Update to pin to a specific commit hash instead of a branch, or
- Update the metadata version to match the current HEAD of
hybrid-epand regenerate it whenever the dependency updates
🤖 Fix all issues with AI agents
In `@nemo_rl/models/megatron/setup.py`:
- Around line 405-412: The new runtime keys moe_flex_dispatcher_backend and
moe_hybridep_num_sms are missing from the MegatronConfig TypedDict and from
example configs; add both to the MegatronConfig definition in
nemo_rl/models/policy/__init__.py as NotRequired entries (use the exact symbol
name MegatronConfig) with short docstrings: "Backend type for MoE flex
dispatcher (HybridEP)" for moe_flex_dispatcher_backend and "Number of SMs for
HybridEP" for moe_hybridep_num_sms, and then update at least one exemplar YAML
in examples/configs (e.g., a megatron MoE config) to include these keys with
sensible defaults (recommended defaults) so they are documented and visible to
users.
🧹 Nitpick comments (1)
pyproject.toml (1)
70-72: Branch ref instead of pinned commit reduces build reproducibility.All three dependency groups now point to
@hybrid-ep(a moving branch) instead of a fixed commit hash. This means builds are not reproducible — a force-push or new commit on that branch silently changes what gets installed. Consider pinning to a specific commit on thehybrid-epbranch once it stabilizes.
| # HybridEP settings for MoE expert parallelism | ||
| # See: https://github.com/deepseek-ai/DeepEP/tree/hybrid-ep | ||
| if "moe_flex_dispatcher_backend" in config["megatron_cfg"]: | ||
| model_cfg.moe_flex_dispatcher_backend = config["megatron_cfg"][ | ||
| "moe_flex_dispatcher_backend" | ||
| ] | ||
| if "moe_hybridep_num_sms" in config["megatron_cfg"]: | ||
| model_cfg.moe_hybridep_num_sms = config["megatron_cfg"]["moe_hybridep_num_sms"] |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
# First, let's understand the file structure and find the TypedDict definition
find . -type f -name "*.py" -path "*/nemo_rl/*" | head -20Repository: NVIDIA-NeMo/RL
Length of output: 805
🏁 Script executed:
# Search for megatron_cfg TypedDict definition and the new keys
rg -n "moe_flex_dispatcher_backend|moe_hybridep_num_sms" --type=pyRepository: NVIDIA-NeMo/RL
Length of output: 565
🏁 Script executed:
# Check for exemplar YAMLs in examples/configs
fd -e yaml -p "examples/configs" | head -20Repository: NVIDIA-NeMo/RL
Length of output: 856
🏁 Script executed:
# Search for megatron_cfg TypedDict definition
rg -n "class.*megatron.*cfg|MegatronCfg" --type=py -i | head -30Repository: NVIDIA-NeMo/RL
Length of output: 40
🏁 Script executed:
# Search for TypedDict definitions in megatron module or related config files
rg -n "TypedDict|class.*Cfg" --type=py nemo_rl/models/megatron/ | head -40Repository: NVIDIA-NeMo/RL
Length of output: 240
🏁 Script executed:
# Check the setup.py file around the lines in question for context
sed -n '380,420p' nemo_rl/models/megatron/setup.pyRepository: NVIDIA-NeMo/RL
Length of output: 1825
🏁 Script executed:
# Check one of the megatron exemplar YAMLs to see if megatron_cfg is documented
cat examples/configs/grpo_math_70B_megatron.yaml | head -100Repository: NVIDIA-NeMo/RL
Length of output: 1826
🏁 Script executed:
# Search for where megatron_cfg is defined as a TypedDict
rg -n "megatron_cfg.*:" --type=py | head -20Repository: NVIDIA-NeMo/RL
Length of output: 1855
🏁 Script executed:
# Look at the config.py file to find MegatronConfig definition
cat nemo_rl/models/megatron/config.pyRepository: NVIDIA-NeMo/RL
Length of output: 3082
🏁 Script executed:
# Also check the policy/__init__.py to see the full PolicyConfig TypedDict
sed -n '1,300p' nemo_rl/models/policy/__init__.py | grep -A 20 "megatron_cfg"Repository: NVIDIA-NeMo/RL
Length of output: 828
🏁 Script executed:
# Search for MegatronConfig class/TypedDict definition
rg -n "class MegatronConfig|MegatronConfig.*=.*TypedDict" --type=pyRepository: NVIDIA-NeMo/RL
Length of output: 207
🏁 Script executed:
# Check the entire policy/__init__.py file
wc -l nemo_rl/models/policy/__init__.pyRepository: NVIDIA-NeMo/RL
Length of output: 95
🏁 Script executed:
# Get the full PolicyConfig definition including imports
head -300 nemo_rl/models/policy/__init__.pyRepository: NVIDIA-NeMo/RL
Length of output: 10006
Add moe_flex_dispatcher_backend and moe_hybridep_num_sms to MegatronConfig TypedDict and exemplar YAMLs.
The two new optional keys are correctly handled at runtime with in checks, but per coding guidelines, new config keys must be documented in the TypedDict subclass and reflected in exemplar YAMLs.
Add both keys to MegatronConfig in nemo_rl/models/policy/__init__.py using NotRequired, with docstrings explaining their purpose and valid values (e.g., "Backend type for MoE flex dispatcher (HybridEP)" and "Number of SMs for HybridEP"). Update at least one exemplar YAML under examples/configs/ (e.g., a megatron MoE config) to include these keys with their recommended defaults.
🤖 Prompt for AI Agents
In `@nemo_rl/models/megatron/setup.py` around lines 405 - 412, The new runtime
keys moe_flex_dispatcher_backend and moe_hybridep_num_sms are missing from the
MegatronConfig TypedDict and from example configs; add both to the
MegatronConfig definition in nemo_rl/models/policy/__init__.py as NotRequired
entries (use the exact symbol name MegatronConfig) with short docstrings:
"Backend type for MoE flex dispatcher (HybridEP)" for
moe_flex_dispatcher_backend and "Number of SMs for HybridEP" for
moe_hybridep_num_sms, and then update at least one exemplar YAML in
examples/configs (e.g., a megatron MoE config) to include these keys with
sensible defaults (recommended defaults) so they are documented and visible to
users.
…odels Add performance recipes and test scripts for HybridEP and CUDA Graph optimizations: - grpo-qwen3-30ba3b-4n4g-hybridep: HybridEP with flex dispatcher for Qwen3-30B-A3B - grpo-qwen3-30ba3b-4n4g-hybridep-cudagraph: HybridEP + CUDA Graph (attn, moe_router) - grpo-qwen3-235b-16n4g-hybridep: HybridEP for Qwen3-235B-A22B - grpo-qwen3-235b-16n4g-hybridep-cudagraph: HybridEP + CUDA Graph for Qwen3-235B-A22B Key configurations: - moe_token_dispatcher_type: flex - moe_flex_dispatcher_backend: hybridep - moe_hybridep_num_sms: 32 - cuda_graph_impl: transformer_engine - cuda_graph_scope: [attn, moe_router]
…on handling - grpo-qwen3-30ba3b-4n4g-hybridep.yaml: sequence_packing.enabled=false (compat note) - ray.sub: CUDA_HOME/PATH for nvcc, attach shell single-quote fix for uv - common.env: exit_if_max_steps_reached handles missing metrics.json - test scripts: metrics.json existence check before jq Co-authored-by: Cursor <cursoragent@cursor.com>
| "deep_ep @ git+https://github.com/deepseek-ai/DeepEP.git@bfded34800dfec415b71503f8205181de90b2480", | ||
| # HybridEP branch for MoE expert parallelism | ||
| # See: https://github.com/deepseek-ai/DeepEP/tree/hybrid-ep | ||
| "deep_ep @ git+https://github.com/deepseek-ai/DeepEP.git@hybrid-ep", |
There was a problem hiding this comment.
| "deep_ep @ git+https://github.com/deepseek-ai/DeepEP.git@hybrid-ep", | |
| "deep_ep @ git+https://github.com/deepseek-ai/DeepEP.git@bfded34800dfec415b71503f8205181de90b2480 ; platform_machine == 'x86_64'", | |
| "deep_ep @ git+https://github.com/deepseek-ai/DeepEP.git@hybrid-ep ; platform_machine == 'aarch64'", |
| "deep_ep @ git+https://github.com/deepseek-ai/DeepEP.git@bfded34800dfec415b71503f8205181de90b2480", | ||
| # HybridEP branch for MoE expert parallelism | ||
| # See: https://github.com/deepseek-ai/DeepEP/tree/hybrid-ep | ||
| "deep_ep @ git+https://github.com/deepseek-ai/DeepEP.git@hybrid-ep", |
There was a problem hiding this comment.
| "deep_ep @ git+https://github.com/deepseek-ai/DeepEP.git@hybrid-ep", | |
| "deep_ep @ git+https://github.com/deepseek-ai/DeepEP.git@bfded34800dfec415b71503f8205181de90b2480 ; platform_machine == 'x86_64'", | |
| "deep_ep @ git+https://github.com/deepseek-ai/DeepEP.git@hybrid-ep ; platform_machine == 'aarch64'", |
| "deep_ep @ git+https://github.com/deepseek-ai/DeepEP.git@bfded34800dfec415b71503f8205181de90b2480", | ||
| # HybridEP branch for MoE expert parallelism | ||
| # See: https://github.com/deepseek-ai/DeepEP/tree/hybrid-ep | ||
| "deep_ep @ git+https://github.com/deepseek-ai/DeepEP.git@hybrid-ep", |
There was a problem hiding this comment.
| "deep_ep @ git+https://github.com/deepseek-ai/DeepEP.git@hybrid-ep", | |
| "deep_ep @ git+https://github.com/deepseek-ai/DeepEP.git@bfded34800dfec415b71503f8205181de90b2480 ; platform_machine == 'x86_64'", | |
| "deep_ep @ git+https://github.com/deepseek-ai/DeepEP.git@hybrid-ep ; platform_machine == 'aarch64'", |
Usage in config:
policy.megatron_cfg.moe_token_dispatcher_type=flex
policy.megatron_cfg.moe_flex_dispatcher_backend=hybridep
policy.megatron_cfg.moe_hybridep_num_sms=32
See: https://github.com/deepseek-ai/DeepEP/tree/hybrid-ep
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
New Features
Dependencies