Skip to content

feat: add draft model support#1921

Draft
shaunjoshi wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
shaunjoshi:shaunak/vllm-specdec
Draft

feat: add draft model support#1921
shaunjoshi wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
shaunjoshi:shaunak/vllm-specdec

Conversation

@shaunjoshi
Copy link

@shaunjoshi shaunjoshi commented Feb 10, 2026

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features

    • Added configuration for GRPO training with Qwen 2.5 7B model using speculative decoding.
  • Documentation

    • Added comprehensive reproduction guide for building NeMo-RL images with vLLM speculative decoding support.
  • Chores

    • Updated vLLM dependency to v0.12 minimum from strict v0.11.2 pin.
    • Updated build script with new vLLM commit references and flexible wheel location handling.
    • Removed sglang from Docker build due to upstream dependency issues.

@shaunjoshi shaunjoshi requested review from a team as code owners February 10, 2026 12:04
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Feb 10, 2026
@shaunjoshi shaunjoshi marked this pull request as draft February 10, 2026 12:05
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 10, 2026

📝 Walkthrough

Walkthrough

This PR adds documentation and configuration for vLLM speculative decoding support in NeMo-RL, updates the vLLM dependency to support newer versions (0.12+), adjusts custom vLLM build defaults, and comments out a broken sglang build step in the Dockerfile.

Changes

Cohort / File(s) Summary
Docker & Build Configuration
docker/Dockerfile, tools/build-custom-vllm.sh
Comments out sglang build due to broken sgl-kernel dependency; updates default vLLM build refs and wheel location logic to use newer commit (4a5299c93) with environment variable override support.
Dependencies
pyproject.toml
Relaxes vLLM version constraint from exact pin (0.11.2) to minimum version (≥0.12).
GRPO Speculative Decoding Configuration
examples/configs/recipes/llm/grpo-qwen2.5-7b-spec-decode-1n8g.yaml
New YAML configuration for GRPO training with Qwen2.5-7B model using vLLM speculative decoding, including checkpoint, training, generation, logging, and cluster settings.
Reproduction & Build Documentation
docs/repro-spec-decode-build.md
Comprehensive guide documenting reproduction workflow for NeMo-RL v0.5.0 with vLLM speculative decoding, including source components, local modifications, step-by-step instructions, verification results, and known issues.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Possibly related PRs

Suggested labels

documentation

Suggested reviewers

  • yfw
  • terrykong
🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'feat: add draft model support' is vague and only partially related to the changeset. While draft models are relevant to the speculative decoding configuration added, the PR encompasses multiple distinct changes including vLLM dependency upgrades, Dockerfile modifications, and comprehensive documentation. Consider using a more specific title that captures the main scope, such as 'feat: add speculative decoding support with Qwen2.5 configuration and vLLM updates' or clarify if draft model support is indeed the primary change.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes ✅ Passed PR includes comprehensive test and verification documentation in reproduction report confirming successful training on specific hardware (viking-prod-228 with 8x H20 GPUs) with validated checkpoints.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@docs/repro-spec-decode-build.md`:
- Around line 26-28: Fenced code blocks in the commit message section (the
triple-backtick blocks around "946862e7 docs: add release runs to front page
readme for 0.5 (`#1879`)") are missing language identifiers; update each of the
affected fenced blocks (the ones around the plain output examples near the
commit header and the longer output block further down) by appending an
appropriate language tag such as ```log or ```text immediately after the opening
backticks so they satisfy MD040 linting.

In `@examples/configs/recipes/llm/grpo-qwen2.5-7b-spec-decode-1n8g.yaml`:
- Around line 1-43: The filename and internal identifiers claim "qwen2.5-7b" but
the policy.model_name (Qwen/Qwen3-4B) and speculative draft model
(vllm_kwargs.speculative_config.model = Qwen/Qwen3-0.6B) indicate Qwen3; rename
the recipe and update internal references to be consistent (e.g., change
filename to grpo-qwen3-4b-spec-decode-1n8g.yaml and update checkpoint_dir,
logger.log_dir, wandb.name, and any other strings that contain "qwen2.5-7b" to
"qwen3-4b"), ensuring the name follows the pattern
<algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>.

In `@pyproject.toml`:
- Line 79: The dependency declaration "vllm>=0.12" lacks an upper bound; change
it to include a conservative upper limit (e.g. "vllm>=0.12,<0.15") in
pyproject.toml to avoid pulling breaking minor releases, then regenerate your
lockfile (run your project’s lock command such as poetry lock or pip-compile)
and run tests/CI to ensure compatibility; update any dependency metadata that
references the old unconstrained spec if present.

In `@tools/build-custom-vllm.sh`:
- Around line 25-28: Update the stale comment to mention the actual PR/commit
(vLLM PR `#24322` targeting 0.14.x) and remove the misleading "v0.10" / git
merge-base text; then make the fallback wheel URL robust by constructing it from
the VLLM_WHEEL_COMMIT variable instead of embedding a hardcoded version string
and add a guard: if a user supplies $3 (overrides VLLM_WHEEL_COMMIT) but does
not set VLLM_PRECOMPILED_WHEEL_LOCATION, print a clear error and exit (or
require they set VLLM_PRECOMPILED_WHEEL_LOCATION) so the commit and wheel URL
cannot get out of sync; reference VLLM_WHEEL_COMMIT and
VLLM_PRECOMPILED_WHEEL_LOCATION in the change.
🧹 Nitpick comments (2)
docker/Dockerfile (1)

117-118: Consider adding a TODO/issue reference for re-enabling the sglang sync.

Commenting out the sglang step is a reasonable workaround, but without a tracking issue it risks being forgotten indefinitely. The CMake install block at lines 48-58 (labeled "for sglang build") also becomes dead weight in the image while this is disabled.

docs/repro-spec-decode-build.md (1)

59-84: This doc describes patches already applied in this PR — clarify the intended audience.

The "Local Modifications" section documents sed commands to patch build-custom-vllm.sh and the Dockerfile, but these changes are already committed in this PR. This could confuse readers who check out this branch and then try to apply the patches again. Consider adding a note that these modifications are already included in the branch/release, and the sed commands are only needed when starting from the upstream v0.5.0 tag.

Comment on lines +26 to +28
```
946862e7 docs: add release runs to front page readme for 0.5 (#1879)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language identifiers to fenced code blocks.

Lines 26, 42, and 245 have fenced code blocks without language specifiers. Use text or log for plain output blocks to satisfy linting (MD040).

Example fix for line 245
-```
+```log
 Initializing a V1 LLM engine (v0.14.0rc2.dev156+g4a5299c93.d20260210)

Also applies to: 42-44, 245-251

🧰 Tools
🪛 markdownlint-cli2 (0.20.0)

[warning] 26-26: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In `@docs/repro-spec-decode-build.md` around lines 26 - 28, Fenced code blocks in
the commit message section (the triple-backtick blocks around "946862e7 docs:
add release runs to front page readme for 0.5 (`#1879`)") are missing language
identifiers; update each of the affected fenced blocks (the ones around the
plain output examples near the commit header and the longer output block further
down) by appending an appropriate language tag such as ```log or ```text
immediately after the opening backticks so they satisfy MD040 linting.

Comment on lines 1 to 43
defaults: ../../grpo_math_1B.yaml
grpo:
max_num_steps: 200
num_prompts_per_step: 32
num_generations_per_prompt: 16
checkpointing:
checkpoint_dir: results/grpo-qwen2.5-7b-spec-decode-1n8g
policy:
model_name: Qwen/Qwen3-4B
tokenizer:
name: Qwen/Qwen3-4B
train_global_batch_size: 512
train_micro_batch_size: 1
logprob_batch_size: 1
max_total_sequence_length: 2048
dynamic_batching:
enabled: true
sequence_packing:
enabled: false
make_sequence_length_divisible_by: 1
generation:
max_new_tokens: 1024
vllm_cfg:
tensor_parallel_size: 1
gpu_memory_utilization: 0.85
max_model_len: 2048
vllm_kwargs:
speculative_config:
method: draft_model
model: Qwen/Qwen3-0.6B
num_speculative_tokens: 5
draft_tensor_parallel_size: 1
data:
max_input_seq_length: 1024
logger:
log_dir: logs/grpo-qwen2.5-7b-spec-decode-1n8g
wandb_enabled: false
tensorboard_enabled: true
wandb:
project: nemo-rl
name: grpo-qwen2.5-7b-spec-decode-1n8g
cluster:
gpus_per_node: 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Filename and config contents are inconsistent — the file says "qwen2.5-7b" but uses Qwen3-4B.

The filename grpo-qwen2.5-7b-spec-decode-1n8g.yaml and internal paths (checkpoint_dir, log_dir, wandb name) all reference "qwen2.5-7b", but the actual model is Qwen/Qwen3-4B (line 9) with draft model Qwen/Qwen3-0.6B (line 30). The repro doc even explains why Qwen3 was chosen over Qwen2.5 (vocab size mismatch).

Rename the file and update internal references to match the actual model, e.g. grpo-qwen3-4b-spec-decode-1n8g.yaml. As per coding guidelines, LLM recipes should follow the naming pattern <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers].

🤖 Prompt for AI Agents
In `@examples/configs/recipes/llm/grpo-qwen2.5-7b-spec-decode-1n8g.yaml` around
lines 1 - 43, The filename and internal identifiers claim "qwen2.5-7b" but the
policy.model_name (Qwen/Qwen3-4B) and speculative draft model
(vllm_kwargs.speculative_config.model = Qwen/Qwen3-0.6B) indicate Qwen3; rename
the recipe and update internal references to be consistent (e.g., change
filename to grpo-qwen3-4b-spec-decode-1n8g.yaml and update checkpoint_dir,
logger.log_dir, wandb.name, and any other strings that contain "qwen2.5-7b" to
"qwen3-4b"), ensuring the name follows the pattern
<algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>.

# sudo apt-get install libibverbs-dev
"deep_ep @ git+https://github.com/deepseek-ai/DeepEP.git@bfded34800dfec415b71503f8205181de90b2480",
"vllm==0.11.2",
"vllm>=0.12",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unbounded upper version for vllm is risky.

vllm>=0.12 has no upper bound. vLLM frequently introduces breaking API changes between minor releases. While uv.lock pins the resolved version, anyone syncing without the lockfile (or when the lock is refreshed) could pick up an incompatible version. Consider adding an upper bound, e.g. vllm>=0.12,<0.15.

🤖 Prompt for AI Agents
In `@pyproject.toml` at line 79, The dependency declaration "vllm>=0.12" lacks an
upper bound; change it to include a conservative upper limit (e.g.
"vllm>=0.12,<0.15") in pyproject.toml to avoid pulling breaking minor releases,
then regenerate your lockfile (run your project’s lock command such as poetry
lock or pip-compile) and run tests/CI to ensure compatibility; update any
dependency metadata that references the old unconstrained spec if present.

Comment on lines 25 to +28
# NOTE: VLLM_USE_PRECOMPILED=1 didn't always seem to work since the wheels were sometimes built against an incompatible torch/cuda combo.
# This commit was chosen as one close to the v0.10 release: git merge-base --fork-point origin/main tags/v0.10.0
VLLM_WHEEL_COMMIT=${3:-862f2ef893d9751db0a92bd2d4ae0e3d9677872f} # use full commit hash from the main branch
export VLLM_PRECOMPILED_WHEEL_LOCATION="https://wheels.vllm.ai/${VLLM_WHEEL_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl"
VLLM_WHEEL_COMMIT=${3:-4a5299c93ff97c26def537b92562df5ada530fea} # merge commit of vllm PR #24322 (spec decode with draft models)
export VLLM_PRECOMPILED_WHEEL_LOCATION="${VLLM_PRECOMPILED_WHEEL_LOCATION:-https://wheels.vllm.ai/${VLLM_WHEEL_COMMIT}/vllm-0.14.0rc2.dev156%2Bg4a5299c93-cp38-abi3-manylinux_2_31_x86_64.whl}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Stale comment and fragile wheel URL default.

Two issues:

  1. Stale comment (line 26): The comment still references "close to the v0.10 release" and git merge-base, but the commit is now from vLLM PR #24322 targeting 0.14.x. This is misleading.

  2. Fragile fallback URL (line 28): The default VLLM_PRECOMPILED_WHEEL_LOCATION embeds a hardcoded version string (vllm-0.14.0rc2.dev156%2Bg4a5299c93) that's specific to the default commit. If a user overrides VLLM_WHEEL_COMMIT via $3 without also setting VLLM_PRECOMPILED_WHEEL_LOCATION, the fallback URL will point to the wrong wheel. Consider either documenting this coupling more prominently or erroring out if $3 is provided without an explicit wheel location.

Suggested comment fix
-# NOTE: VLLM_USE_PRECOMPILED=1 didn't always seem to work since the wheels were sometimes built against an incompatible torch/cuda combo.
-# This commit was chosen as one close to the v0.10 release: git merge-base --fork-point origin/main tags/v0.10.0
+# NOTE: VLLM_USE_PRECOMPILED=1 didn't always seem to work since the wheels were sometimes built against an incompatible torch/cuda combo.
+# WARNING: The default wheel URL below is specific to the default VLLM_WHEEL_COMMIT. If overriding $3, also set VLLM_PRECOMPILED_WHEEL_LOCATION.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# NOTE: VLLM_USE_PRECOMPILED=1 didn't always seem to work since the wheels were sometimes built against an incompatible torch/cuda combo.
# This commit was chosen as one close to the v0.10 release: git merge-base --fork-point origin/main tags/v0.10.0
VLLM_WHEEL_COMMIT=${3:-862f2ef893d9751db0a92bd2d4ae0e3d9677872f} # use full commit hash from the main branch
export VLLM_PRECOMPILED_WHEEL_LOCATION="https://wheels.vllm.ai/${VLLM_WHEEL_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl"
VLLM_WHEEL_COMMIT=${3:-4a5299c93ff97c26def537b92562df5ada530fea} # merge commit of vllm PR #24322 (spec decode with draft models)
export VLLM_PRECOMPILED_WHEEL_LOCATION="${VLLM_PRECOMPILED_WHEEL_LOCATION:-https://wheels.vllm.ai/${VLLM_WHEEL_COMMIT}/vllm-0.14.0rc2.dev156%2Bg4a5299c93-cp38-abi3-manylinux_2_31_x86_64.whl}"
# NOTE: VLLM_USE_PRECOMPILED=1 didn't always seem to work since the wheels were sometimes built against an incompatible torch/cuda combo.
# WARNING: The default wheel URL below is specific to the default VLLM_WHEEL_COMMIT. If overriding $3, also set VLLM_PRECOMPILED_WHEEL_LOCATION.
VLLM_WHEEL_COMMIT=${3:-4a5299c93ff97c26def537b92562df5ada530fea} # merge commit of vllm PR `#24322` (spec decode with draft models)
export VLLM_PRECOMPILED_WHEEL_LOCATION="${VLLM_PRECOMPILED_WHEEL_LOCATION:-https://wheels.vllm.ai/${VLLM_WHEEL_COMMIT}/vllm-0.14.0rc2.dev156%2Bg4a5299c93-cp38-abi3-manylinux_2_31_x86_64.whl}"
🤖 Prompt for AI Agents
In `@tools/build-custom-vllm.sh` around lines 25 - 28, Update the stale comment to
mention the actual PR/commit (vLLM PR `#24322` targeting 0.14.x) and remove the
misleading "v0.10" / git merge-base text; then make the fallback wheel URL
robust by constructing it from the VLLM_WHEEL_COMMIT variable instead of
embedding a hardcoded version string and add a guard: if a user supplies $3
(overrides VLLM_WHEEL_COMMIT) but does not set VLLM_PRECOMPILED_WHEEL_LOCATION,
print a clear error and exit (or require they set
VLLM_PRECOMPILED_WHEEL_LOCATION) so the commit and wheel URL cannot get out of
sync; reference VLLM_WHEEL_COMMIT and VLLM_PRECOMPILED_WHEEL_LOCATION in the
change.

@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Feb 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request documentation Improvements or additions to documentation needs-follow-up Issue needs follow-up

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants