feat: add draft model support by shaunjoshi · Pull Request #1921 · NVIDIA-NeMo/RL

shaunjoshi · 2026-02-10T12:04:26Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Added configuration for GRPO training with Qwen 2.5 7B model using speculative decoding.
Documentation
- Added comprehensive reproduction guide for building NeMo-RL images with vLLM speculative decoding support.
Chores
- Updated vLLM dependency to v0.12 minimum from strict v0.11.2 pin.
- Updated build script with new vLLM commit references and flexible wheel location handling.
- Removed sglang from Docker build due to upstream dependency issues.

coderabbitai · 2026-02-10T12:10:03Z

📝 Walkthrough

Walkthrough

This PR adds documentation and configuration for vLLM speculative decoding support in NeMo-RL, updates the vLLM dependency to support newer versions (0.12+), adjusts custom vLLM build defaults, and comments out a broken sglang build step in the Dockerfile.

Changes

Cohort / File(s)	Summary
Docker & Build Configuration `docker/Dockerfile`, `tools/build-custom-vllm.sh`	Comments out sglang build due to broken sgl-kernel dependency; updates default vLLM build refs and wheel location logic to use newer commit (4a5299c93) with environment variable override support.
Dependencies `pyproject.toml`	Relaxes vLLM version constraint from exact pin (0.11.2) to minimum version (≥0.12).
GRPO Speculative Decoding Configuration `examples/configs/recipes/llm/grpo-qwen2.5-7b-spec-decode-1n8g.yaml`	New YAML configuration for GRPO training with Qwen2.5-7B model using vLLM speculative decoding, including checkpoint, training, generation, logging, and cluster settings.
Reproduction & Build Documentation `docs/repro-spec-decode-build.md`	Comprehensive guide documenting reproduction workflow for NeMo-RL v0.5.0 with vLLM speculative decoding, including source components, local modifications, step-by-step instructions, verification results, and known issues.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Possibly related PRs

chore: Bump vllm to 0.11.2, torch to 2.9, transformers to 4.57.1 #1563: Modifies vLLM versioning and custom vLLM build script defaults similarly.
fix: update the custom vllm instructions #1116: Also adjusts custom vLLM build flow and Docker build behavior for custom vLLM handling.
feat: add speculative decoding during post-training #1785: Adds vLLM speculative-decoding support with dependency and configuration changes.

Suggested labels

documentation

Suggested reviewers

yfw
terrykong

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'feat: add draft model support' is vague and only partially related to the changeset. While draft models are relevant to the speculative decoding configuration added, the PR encompasses multiple distinct changes including vLLM dependency upgrades, Dockerfile modifications, and comprehensive documentation.	Consider using a more specific title that captures the main scope, such as 'feat: add speculative decoding support with Qwen2.5 configuration and vLLM updates' or clarify if draft model support is indeed the primary change.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes	✅ Passed	PR includes comprehensive test and verification documentation in reproduction report confirming successful training on specific hardware (viking-prod-228 with 8x H20 GPUs) with validated checkpoints.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@docs/repro-spec-decode-build.md`:
- Around line 26-28: Fenced code blocks in the commit message section (the
triple-backtick blocks around "946862e7 docs: add release runs to front page
readme for 0.5 (`#1879`)") are missing language identifiers; update each of the
affected fenced blocks (the ones around the plain output examples near the
commit header and the longer output block further down) by appending an
appropriate language tag such as ```log or ```text immediately after the opening
backticks so they satisfy MD040 linting.

In `@examples/configs/recipes/llm/grpo-qwen2.5-7b-spec-decode-1n8g.yaml`:
- Around line 1-43: The filename and internal identifiers claim "qwen2.5-7b" but
the policy.model_name (Qwen/Qwen3-4B) and speculative draft model
(vllm_kwargs.speculative_config.model = Qwen/Qwen3-0.6B) indicate Qwen3; rename
the recipe and update internal references to be consistent (e.g., change
filename to grpo-qwen3-4b-spec-decode-1n8g.yaml and update checkpoint_dir,
logger.log_dir, wandb.name, and any other strings that contain "qwen2.5-7b" to
"qwen3-4b"), ensuring the name follows the pattern
<algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>.

In `@pyproject.toml`:
- Line 79: The dependency declaration "vllm>=0.12" lacks an upper bound; change
it to include a conservative upper limit (e.g. "vllm>=0.12,<0.15") in
pyproject.toml to avoid pulling breaking minor releases, then regenerate your
lockfile (run your project’s lock command such as poetry lock or pip-compile)
and run tests/CI to ensure compatibility; update any dependency metadata that
references the old unconstrained spec if present.

In `@tools/build-custom-vllm.sh`:
- Around line 25-28: Update the stale comment to mention the actual PR/commit
(vLLM PR `#24322` targeting 0.14.x) and remove the misleading "v0.10" / git
merge-base text; then make the fallback wheel URL robust by constructing it from
the VLLM_WHEEL_COMMIT variable instead of embedding a hardcoded version string
and add a guard: if a user supplies $3 (overrides VLLM_WHEEL_COMMIT) but does
not set VLLM_PRECOMPILED_WHEEL_LOCATION, print a clear error and exit (or
require they set VLLM_PRECOMPILED_WHEEL_LOCATION) so the commit and wheel URL
cannot get out of sync; reference VLLM_WHEEL_COMMIT and
VLLM_PRECOMPILED_WHEEL_LOCATION in the change.

🧹 Nitpick comments (2)

docker/Dockerfile (1)

117-118: Consider adding a TODO/issue reference for re-enabling the sglang sync.

Commenting out the sglang step is a reasonable workaround, but without a tracking issue it risks being forgotten indefinitely. The CMake install block at lines 48-58 (labeled "for sglang build") also becomes dead weight in the image while this is disabled.

docs/repro-spec-decode-build.md (1)

59-84: This doc describes patches already applied in this PR — clarify the intended audience.

The "Local Modifications" section documents sed commands to patch build-custom-vllm.sh and the Dockerfile, but these changes are already committed in this PR. This could confuse readers who check out this branch and then try to apply the patches again. Consider adding a note that these modifications are already included in the branch/release, and the sed commands are only needed when starting from the upstream v0.5.0 tag.

coderabbitai · 2026-02-10T12:10:06Z

docs/repro-spec-decode-build.md

+```
+946862e7 docs: add release runs to front page readme for 0.5 (#1879)
+```


⚠️ Potential issue | 🟡 Minor

Add language identifiers to fenced code blocks.

Lines 26, 42, and 245 have fenced code blocks without language specifiers. Use text or log for plain output blocks to satisfy linting (MD040).

Example fix for line 245

-``` +```log Initializing a V1 LLM engine (v0.14.0rc2.dev156+g4a5299c93.d20260210)

Also applies to: 42-44, 245-251

🧰 Tools

🪛 markdownlint-cli2 (0.20.0)

[warning] 26-26: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

In `@docs/repro-spec-decode-build.md` around lines 26 - 28, Fenced code blocks in the commit message section (the triple-backtick blocks around "946862e7 docs: add release runs to front page readme for 0.5 (`#1879`)") are missing language identifiers; update each of the affected fenced blocks (the ones around the plain output examples near the commit header and the longer output block further down) by appending an appropriate language tag such as ```log or ```text immediately after the opening backticks so they satisfy MD040 linting.

coderabbitai · 2026-02-10T12:10:06Z

examples/configs/recipes/llm/grpo-qwen2.5-7b-spec-decode-1n8g.yaml

+defaults: ../../grpo_math_1B.yaml
+grpo:
+  max_num_steps: 200
+  num_prompts_per_step: 32
+  num_generations_per_prompt: 16
+checkpointing:
+  checkpoint_dir: results/grpo-qwen2.5-7b-spec-decode-1n8g
+policy:
+  model_name: Qwen/Qwen3-4B
+  tokenizer:
+    name: Qwen/Qwen3-4B
+  train_global_batch_size: 512
+  train_micro_batch_size: 1
+  logprob_batch_size: 1
+  max_total_sequence_length: 2048
+  dynamic_batching:
+    enabled: true
+  sequence_packing:
+    enabled: false
+  make_sequence_length_divisible_by: 1
+  generation:
+    max_new_tokens: 1024
+    vllm_cfg:
+      tensor_parallel_size: 1
+      gpu_memory_utilization: 0.85
+      max_model_len: 2048
+    vllm_kwargs:
+      speculative_config:
+        method: draft_model
+        model: Qwen/Qwen3-0.6B
+        num_speculative_tokens: 5
+        draft_tensor_parallel_size: 1
+data:
+  max_input_seq_length: 1024
+logger:
+  log_dir: logs/grpo-qwen2.5-7b-spec-decode-1n8g
+  wandb_enabled: false
+  tensorboard_enabled: true
+  wandb:
+    project: nemo-rl
+    name: grpo-qwen2.5-7b-spec-decode-1n8g
+cluster:
+  gpus_per_node: 8


⚠️ Potential issue | 🟠 Major

Filename and config contents are inconsistent — the file says "qwen2.5-7b" but uses Qwen3-4B.

The filename grpo-qwen2.5-7b-spec-decode-1n8g.yaml and internal paths (checkpoint_dir, log_dir, wandb name) all reference "qwen2.5-7b", but the actual model is Qwen/Qwen3-4B (line 9) with draft model Qwen/Qwen3-0.6B (line 30). The repro doc even explains why Qwen3 was chosen over Qwen2.5 (vocab size mismatch).

Rename the file and update internal references to match the actual model, e.g. grpo-qwen3-4b-spec-decode-1n8g.yaml. As per coding guidelines, LLM recipes should follow the naming pattern <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers].

🤖 Prompt for AI Agents

In `@examples/configs/recipes/llm/grpo-qwen2.5-7b-spec-decode-1n8g.yaml` around lines 1 - 43, The filename and internal identifiers claim "qwen2.5-7b" but the policy.model_name (Qwen/Qwen3-4B) and speculative draft model (vllm_kwargs.speculative_config.model = Qwen/Qwen3-0.6B) indicate Qwen3; rename the recipe and update internal references to be consistent (e.g., change filename to grpo-qwen3-4b-spec-decode-1n8g.yaml and update checkpoint_dir, logger.log_dir, wandb.name, and any other strings that contain "qwen2.5-7b" to "qwen3-4b"), ensuring the name follows the pattern <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>.

coderabbitai · 2026-02-10T12:10:06Z

pyproject.toml

  # sudo apt-get install libibverbs-dev
  "deep_ep @ git+https://github.com/deepseek-ai/DeepEP.git@bfded34800dfec415b71503f8205181de90b2480",
-  "vllm==0.11.2",
+  "vllm>=0.12",


⚠️ Potential issue | 🟡 Minor

Unbounded upper version for vllm is risky.

vllm>=0.12 has no upper bound. vLLM frequently introduces breaking API changes between minor releases. While uv.lock pins the resolved version, anyone syncing without the lockfile (or when the lock is refreshed) could pick up an incompatible version. Consider adding an upper bound, e.g. vllm>=0.12,<0.15.

🤖 Prompt for AI Agents

In `@pyproject.toml` at line 79, The dependency declaration "vllm>=0.12" lacks an upper bound; change it to include a conservative upper limit (e.g. "vllm>=0.12,<0.15") in pyproject.toml to avoid pulling breaking minor releases, then regenerate your lockfile (run your project’s lock command such as poetry lock or pip-compile) and run tests/CI to ensure compatibility; update any dependency metadata that references the old unconstrained spec if present.

coderabbitai · 2026-02-10T12:10:06Z

tools/build-custom-vllm.sh

 # NOTE: VLLM_USE_PRECOMPILED=1 didn't always seem to work since the wheels were sometimes built against an incompatible torch/cuda combo.
 # This commit was chosen as one close to the v0.10 release: git merge-base --fork-point origin/main tags/v0.10.0
-VLLM_WHEEL_COMMIT=${3:-862f2ef893d9751db0a92bd2d4ae0e3d9677872f}  # use full commit hash from the main branch
-export VLLM_PRECOMPILED_WHEEL_LOCATION="https://wheels.vllm.ai/${VLLM_WHEEL_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl"
+VLLM_WHEEL_COMMIT=${3:-4a5299c93ff97c26def537b92562df5ada530fea}  # merge commit of vllm PR #24322 (spec decode with draft models)
+export VLLM_PRECOMPILED_WHEEL_LOCATION="${VLLM_PRECOMPILED_WHEEL_LOCATION:-https://wheels.vllm.ai/${VLLM_WHEEL_COMMIT}/vllm-0.14.0rc2.dev156%2Bg4a5299c93-cp38-abi3-manylinux_2_31_x86_64.whl}"


⚠️ Potential issue | 🟡 Minor

Stale comment and fragile wheel URL default.

Two issues:

Stale comment (line 26): The comment still references "close to the v0.10 release" and git merge-base, but the commit is now from vLLM PR #24322 targeting 0.14.x. This is misleading.

Fragile fallback URL (line 28): The default VLLM_PRECOMPILED_WHEEL_LOCATION embeds a hardcoded version string (vllm-0.14.0rc2.dev156%2Bg4a5299c93) that's specific to the default commit. If a user overrides VLLM_WHEEL_COMMIT via $3 without also setting VLLM_PRECOMPILED_WHEEL_LOCATION, the fallback URL will point to the wrong wheel. Consider either documenting this coupling more prominently or erroring out if $3 is provided without an explicit wheel location.

Suggested comment fix

-# NOTE: VLLM_USE_PRECOMPILED=1 didn't always seem to work since the wheels were sometimes built against an incompatible torch/cuda combo. -# This commit was chosen as one close to the v0.10 release: git merge-base --fork-point origin/main tags/v0.10.0 +# NOTE: VLLM_USE_PRECOMPILED=1 didn't always seem to work since the wheels were sometimes built against an incompatible torch/cuda combo. +# WARNING: The default wheel URL below is specific to the default VLLM_WHEEL_COMMIT. If overriding $3, also set VLLM_PRECOMPILED_WHEEL_LOCATION.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# NOTE: VLLM_USE_PRECOMPILED=1 didn't always seem to work since the wheels were sometimes built against an incompatible torch/cuda combo.

# This commit was chosen as one close to the v0.10 release: git merge-base --fork-point origin/main tags/v0.10.0

VLLM_WHEEL_COMMIT=${3:-862f2ef893d9751db0a92bd2d4ae0e3d9677872f} # use full commit hash from the main branch

export VLLM_PRECOMPILED_WHEEL_LOCATION="https://wheels.vllm.ai/${VLLM_WHEEL_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl"

VLLM_WHEEL_COMMIT=${3:-4a5299c93ff97c26def537b92562df5ada530fea} # merge commit of vllm PR #24322 (spec decode with draft models)

export VLLM_PRECOMPILED_WHEEL_LOCATION="${VLLM_PRECOMPILED_WHEEL_LOCATION:-https://wheels.vllm.ai/${VLLM_WHEEL_COMMIT}/vllm-0.14.0rc2.dev156%2Bg4a5299c93-cp38-abi3-manylinux_2_31_x86_64.whl}"

# NOTE: VLLM_USE_PRECOMPILED=1 didn't always seem to work since the wheels were sometimes built against an incompatible torch/cuda combo.

# WARNING: The default wheel URL below is specific to the default VLLM_WHEEL_COMMIT. If overriding $3, also set VLLM_PRECOMPILED_WHEEL_LOCATION.

VLLM_WHEEL_COMMIT=${3:-4a5299c93ff97c26def537b92562df5ada530fea} # merge commit of vllm PR `#24322` (spec decode with draft models)

export VLLM_PRECOMPILED_WHEEL_LOCATION="${VLLM_PRECOMPILED_WHEEL_LOCATION:-https://wheels.vllm.ai/${VLLM_WHEEL_COMMIT}/vllm-0.14.0rc2.dev156%2Bg4a5299c93-cp38-abi3-manylinux_2_31_x86_64.whl}"

🤖 Prompt for AI Agents

In `@tools/build-custom-vllm.sh` around lines 25 - 28, Update the stale comment to mention the actual PR/commit (vLLM PR `#24322` targeting 0.14.x) and remove the misleading "v0.10" / git merge-base text; then make the fallback wheel URL robust by constructing it from the VLLM_WHEEL_COMMIT variable instead of embedding a hardcoded version string and add a guard: if a user supplies $3 (overrides VLLM_WHEEL_COMMIT) but does not set VLLM_PRECOMPILED_WHEEL_LOCATION, print a clear error and exit (or require they set VLLM_PRECOMPILED_WHEEL_LOCATION) so the commit and wheel URL cannot get out of sync; reference VLLM_WHEEL_COMMIT and VLLM_PRECOMPILED_WHEEL_LOCATION in the change.

feat: add draft model support

7ebc8a8

shaunjoshi requested review from a team as code owners February 10, 2026 12:04

github-actions bot added the documentation Improvements or additions to documentation label Feb 10, 2026

shaunjoshi marked this pull request as draft February 10, 2026 12:05

coderabbitai bot reviewed Feb 10, 2026

View reviewed changes

github-actions bot added the community-request label Feb 10, 2026

shaunjoshi added 2 commits February 11, 2026 18:08

feat: additional training configs

0ca4457

feat: add spec decode token acceptance rate metrics

1bf163c

chtruong814 added the needs-follow-up Issue needs follow-up label Feb 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add draft model support#1921

feat: add draft model support#1921
shaunjoshi wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
shaunjoshi:shaunak/vllm-specdec

shaunjoshi commented Feb 10, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 10, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 10, 2026

Uh oh!

coderabbitai bot Feb 10, 2026

Uh oh!

coderabbitai bot Feb 10, 2026

Uh oh!

coderabbitai bot Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shaunjoshi commented Feb 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 10, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shaunjoshi commented Feb 10, 2026 •

edited by coderabbitai bot

Loading