Skip to content

Shaunak/vllm specdec metrics#1941

Open
shaunjoshi wants to merge 4 commits intoNVIDIA-NeMo:mainfrom
shaunjoshi:shaunak/vllm-specdec-metrics
Open

Shaunak/vllm specdec metrics#1941
shaunjoshi wants to merge 4 commits intoNVIDIA-NeMo:mainfrom
shaunjoshi:shaunak/vllm-specdec-metrics

Conversation

@shaunjoshi
Copy link

@shaunjoshi shaunjoshi commented Feb 13, 2026

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

Release Notes

  • New Features

    • Added training configurations for Qwen 2.5B, 14B, and 32B models with speculative decoding variants and standard decoding options
    • Enhanced metrics collection with token acceptance rates for improved speculative decoding visibility
  • Documentation

    • Added comprehensive guide for building custom Docker images with vLLM speculative decoding support
  • Dependencies

    • Updated vLLM from 0.11.2 to 0.15.1+ with Python <3.13 requirement

@shaunjoshi shaunjoshi requested review from a team as code owners February 13, 2026 06:14
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Feb 13, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 13, 2026

📝 Walkthrough

Walkthrough

This PR adds speculative decoding metrics collection and logging to the GRPO training pipeline, updates vLLM-related build dependencies and configurations, adds nine GRPO configuration files for various Qwen model setups with speculative decoding, and includes reproduction documentation and build tooling updates.

Changes

Cohort / File(s) Summary
Docker and Build Configuration
docker/Dockerfile, tools/build-custom-vllm.sh, pyproject.toml
Updated vLLM dependency version from 0.11.2 to >=0.15.1,<0.16; commented out runtime sglang backend sync step in Dockerfile; refined TOML mutation logic for vLLM constraint handling; added Python version constraint (<3.13) and expanded no-build-isolation-package list with additional dependencies.
Documentation and Reproduction Guide
docs/repro-spec-decode-build.md
Added comprehensive reproduction report detailing steps to build custom NeMo-RL Docker image with vLLM speculative decoding, including source provenance, patches, build commands, verification results, and known issues related to sglang and async scheduling.
GRPO Configuration Files
examples/configs/recipes/llm/grpo-qwen*.yaml
Added nine YAML configuration files for GRPO experiments with Qwen2.5-7B and Qwen3-14B/32B models, with and without speculative decoding, featuring vLLM setup, speculative draft model parameters, batching strategies, logging, and cluster resource allocation.
Speculative Decoding Metrics Collection
nemo_rl/algorithms/utils.py, nemo_rl/algorithms/grpo.py, nemo_rl/models/generation/vllm/vllm_generation.py, nemo_rl/models/generation/vllm/vllm_worker_async.py
Added utility functions to compute token-acceptance metrics from spec-decode counters; integrated metric computation and logging in GRPO training; extended vLLM metrics collection to capture accepted/proposed token counts, counter names, and acceptance rate; improved compilation config handling in vLLM worker.
Tests
tests/unit/algorithms/test_utils.py
Added unit tests for compute_spec_decode_token_acceptance_metrics utility function, covering normal metric computation and missing-counter edge cases.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • PR #1785: Modifies the same speculative-decoding metric paths, adding/consuming spec-decode counters and instrumenting GRPO/vLLM code for counter collection and logging.
  • PR #1334: Overlaps in vLLM/UV/Docker/build-file and metrics areas, extending related build and vLLM speculative-decoding functionality.
  • PR #1491: Modifies Dockerfile venv/build steps; related at the Docker/build-step level with complementary container fingerprinting and build logic.

Suggested labels

documentation, CI:L1

Suggested reviewers

  • yfw
  • gshennvm
  • terrykong
🚥 Pre-merge checks | ✅ 1 | ❌ 4
❌ Failed checks (3 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.91% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (84 files):

⚔️ .github/workflows/release-docs.yml (content)
⚔️ README.md (content)
⚔️ docker/Dockerfile (content)
⚔️ docs/about/performance-summary.md (content)
⚔️ docs/conf.py (content)
⚔️ docs/design-docs/dependency-management.md (content)
⚔️ docs/docker.md (content)
⚔️ docs/guides/dpo.md (content)
⚔️ docs/guides/grpo.md (content)
⚔️ docs/guides/rm.md (content)
⚔️ docs/guides/sft.md (content)
⚔️ docs/index.md (content)
⚔️ docs/versions1.json (content)
⚔️ examples/configs/distillation_math.yaml (content)
⚔️ examples/configs/distillation_math_megatron.yaml (content)
⚔️ examples/configs/dpo.yaml (content)
⚔️ examples/configs/grpo_math_1B.yaml (content)
⚔️ examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-2n8g-megatron-fp8-e2e.yaml (content)
⚔️ examples/configs/recipes/llm/grpo-moonlight-16ba3b-4n8g-megatron-fp8-e2e.yaml (content)
⚔️ examples/configs/recipes/llm/performance/grpo-llama3.1-8b-instruct-2n8g-fp8-async-1off.yaml (content)
⚔️ examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yaml (content)
⚔️ examples/configs/rm.yaml (content)
⚔️ examples/configs/sft.yaml (content)
⚔️ examples/configs/sft_openmathinstruct2_megatron.yaml (content)
⚔️ examples/configs/vlm_grpo_3B.yaml (content)
⚔️ examples/configs/vlm_grpo_3B_megatron.yaml (content)
⚔️ examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml (content)
⚔️ examples/nemo_gym/run_grpo_nemo_gym.py (content)
⚔️ examples/run_distillation.py (content)
⚔️ examples/run_grpo.py (content)
⚔️ examples/run_grpo_sliding_puzzle.py (content)
⚔️ examples/run_sft.py (content)
⚔️ examples/run_vlm_grpo.py (content)
⚔️ nemo_rl/algorithms/grpo.py (content)
⚔️ nemo_rl/algorithms/loss_functions.py (content)
⚔️ nemo_rl/algorithms/reward_functions.py (content)
⚔️ nemo_rl/algorithms/utils.py (content)
⚔️ nemo_rl/data/__init__.py (content)
⚔️ nemo_rl/data/datasets/preference_datasets/binary_preference_dataset.py (content)
⚔️ nemo_rl/data/datasets/preference_datasets/preference_dataset.py (content)
⚔️ nemo_rl/data/datasets/response_datasets/__init__.py (content)
⚔️ nemo_rl/data/datasets/response_datasets/response_dataset.py (content)
⚔️ nemo_rl/data/datasets/utils.py (content)
⚔️ nemo_rl/data/processors.py (content)
⚔️ nemo_rl/data/utils.py (content)
⚔️ nemo_rl/environments/nemo_gym.py (content)
⚔️ nemo_rl/environments/utils.py (content)
⚔️ nemo_rl/experience/rollouts.py (content)
⚔️ nemo_rl/models/automodel/data.py (content)
⚔️ nemo_rl/models/generation/__init__.py (content)
⚔️ nemo_rl/models/generation/interfaces.py (content)
⚔️ nemo_rl/models/generation/vllm/utils.py (content)
⚔️ nemo_rl/models/generation/vllm/vllm_generation.py (content)
⚔️ nemo_rl/models/generation/vllm/vllm_worker.py (content)
⚔️ nemo_rl/models/generation/vllm/vllm_worker_async.py (content)
⚔️ nemo_rl/models/megatron/data.py (content)
⚔️ nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py (content)
⚔️ nemo_rl/models/policy/workers/megatron_policy_worker.py (content)
⚔️ nemo_rl/utils/config.py (content)
⚔️ nemo_rl/utils/logger.py (content)
⚔️ nemo_rl/utils/native_checkpoint.py (content)
⚔️ nemo_rl/utils/prefetch_venvs.py (content)
⚔️ pyproject.toml (content)
⚔️ pyrefly.toml (content)
⚔️ research/template_project/configs/grpo_math_1B.yaml (content)
⚔️ research/template_project/single_update.py (content)
⚔️ tests/functional/L1_Functional_Tests_GPU.sh (content)
⚔️ tests/functional/dpo_megatron.sh (content)
⚔️ tests/test_suites/nightly.txt (content)
⚔️ tests/unit/algorithms/test_grpo.py (content)
⚔️ tests/unit/algorithms/test_loss_functions.py (content)
⚔️ tests/unit/algorithms/test_reward_functions.py (content)
⚔️ tests/unit/algorithms/test_utils.py (content)
⚔️ tests/unit/data/datasets/test_response_dataset.py (content)
⚔️ tests/unit/experience/test_rollouts.py (content)
⚔️ tests/unit/models/generation/test_vllm_utils.py (content)
⚔️ tests/unit/models/megatron/test_megatron_data.py (content)
⚔️ tests/unit/test_config_validation.py (content)
⚔️ tests/unit/test_recipes_and_test_suites.py (content)
⚔️ tests/unit/utils/test_logger.py (content)
⚔️ tests/unit/utils/test_native_checkpoint.py (content)
⚔️ tools/build-custom-vllm.sh (content)
⚔️ tools/config_cli.py (content)
⚔️ uv.lock (content)

These conflicts must be resolved before merging into main.
Resolve conflicts locally and push changes to this branch.
Test Results For Major Changes ⚠️ Warning PR description incomplete; lacks test results, performance impact assessment, convergence validation, and vLLM upgrade regression testing documentation. Document unit/functional test results, performance comparisons between vLLM versions, convergence validation, regression testing results, and mark checklist items complete.
Title check ❓ Inconclusive The title 'Shaunak/vllm specdec metrics' is overly vague and uses non-descriptive author name prefix and abbreviations; it does not convey meaningful information about the changeset to a teammate scanning history. Revise title to clearly describe the main change, e.g., 'Add spec-decode token acceptance metrics logging' or 'Integrate vLLM speculative decode metrics and training configs'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
⚔️ Resolve merge conflicts (beta)
  • Auto-commit resolved conflicts to branch shaunak/vllm-specdec-metrics
  • Post resolved changes as copyable diffs in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🤖 Fix all issues with AI agents
In `@docs/repro-spec-decode-build.md`:
- Around line 1-6: The doc exposes internal infra details—replace the specific
hostnames and paths mentioned in the header and body (e.g., "viking-prod-228",
"computelab-sc-01", and "/home/scratch.shaunakj_other/") with generalized
placeholders like "<host-name>" or "<your-scratch-dir>" across the "Reproduction
Report: NeMo-RL v0.5.0 + vLLM Speculative Decoding with Draft Models" document;
search for those exact strings and update them to non-sensitive placeholders
and, if needed, add a short note that readers should substitute their own
environment values.

In `@examples/configs/recipes/llm/grpo-qwen2.5-7b-spec-decode-1n8g.yaml`:
- Around line 1-46: The config filename and several path/name fields reference
qwen2.5-7b but the policy.model_name and policy.tokenizer.name are set to
Qwen/Qwen3-4B; make them consistent by either (A) updating policy.model_name and
policy.tokenizer.name to the intended Qwen2.5-7b model identifier (and adjust
any vllm_kwargs/speculative model names accordingly), or (B) rename the filename
and all related identifiers (checkpoint_dir, logger.log_dir, logger.wandb.name,
and any other qwen2.5-7b strings) to reflect Qwen3-4B; ensure you also update
the speculative_config.model if using speculative decoding so it matches the
chosen base model.

In `@examples/configs/recipes/llm/grpo-qwen3-14b-no-spec-decode-1n8g.yaml`:
- Around line 1-34: This config is missing several keys present in the lowbatch
variant; update the YAML to explicitly add checkpointing.enabled (e.g., false if
you want the same behavior as lowbatch) and the generation keys under policy:
policy.generation.temperature, policy.generation.vllm_cfg.async_engine, and
policy.generation.vllm_cfg.enable_vllm_metrics_logger (and
vllm_metrics_logger_interval) to match the lowbatch settings or to the intended
values; locate these settings by the symbols checkpointing.enabled,
policy.generation.temperature, policy.generation.vllm_cfg.async_engine,
policy.generation.vllm_cfg.enable_vllm_metrics_logger, and
vllm_metrics_logger_interval and add them with the desired values so behavior is
explicit rather than relying on implicit defaults.

In
`@examples/configs/recipes/llm/grpo-qwen3-32b-no-spec-decode-lowbatch-1n8g.yaml`:
- Line 1: The recipe filename violates the LLM naming convention by placing the
node/GPU segment at the end; rename the file from
grpo-qwen3-32b-no-spec-decode-lowbatch-1n8g.yaml to follow the pattern used by
other LLM recipes (e.g., grpo-qwen3-32b-1n8g-no-spec-decode-lowbatch.yaml) so
the node/GPU segment appears immediately after the model, ensuring consistency
with the naming rule
"<algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].(yaml|sh)"
and update any references to this filename in tooling or docs (search for
occurrences of grpo-qwen3-32b-no-spec-decode-lowbatch-1n8g.yaml and replace).

In `@examples/configs/recipes/llm/grpo-qwen3-32b-spec-decode-lowbatch-1n8g.yaml`:
- Line 26: The config sets tensor_parallel_size: 1 for a 32B model which is
likely to OOM on 80GB GPUs when gpu_memory_utilization is 0.85; update the YAML
to either document a minimum GPU memory requirement (mentioning 96GB/H20 or
similar) or increase tensor_parallel_size (e.g., to 2 or more) so model weights
fit alongside KV cache and draft model—adjust the comments and any README
references for this example and ensure the fields tensor_parallel_size and
gpu_memory_utilization in the grpo-qwen3-32b-spec-decode-lowbatch-1n8g.yaml
reflect the new recommended values or include the minimum-GPU note.

In `@nemo_rl/algorithms/utils.py`:
- Around line 386-415: Both helper functions lack Google-style docstring
sections: add an "Args" section to _counter_delta_over_step(counter_series)
describing counter_series: list[Any] and its expected content/type, and add both
"Args" and "Returns" to
compute_spec_decode_token_acceptance_metrics(generation_logger_metrics)
describing generation_logger_metrics: dict[str, Any] and the returned dict[str,
float]; ensure the docstrings follow the repo's Google style (param
descriptions, return description, types) and keep existing summary lines intact
for _counter_delta_over_step and compute_spec_decode_token_acceptance_metrics.

In `@pyproject.toml`:
- Line 79: Dependency change to "vllm>=0.15.1,<0.16" may mismatch the custom
build v0.14.0rc2; audit all imports from the v1 API (e.g.,
vllm.v1.engine.async_llm, vllm.v1.metrics.*, vllm.v1.executor.*) to ensure the
specific functions/classes we use exist and behave identically in both v0.14.x
and v0.15.x, run unit/integration tests against both versions, and if any
incompatibility is found either pin pyproject.toml to a compatible range or add
conditional imports/shims (or adjust tools/build-custom-vllm.sh to produce a
matching v0.15.x dev build) so both PyPI installs and custom builds work
consistently.

In `@tests/unit/algorithms/test_utils.py`:
- Around line 599-626: Add Google-style docstrings to the two test functions
test_compute_spec_decode_token_acceptance_metrics_normal_case and
test_compute_spec_decode_token_acceptance_metrics_missing_accepted_counter: each
docstring should be a short one-line summary of what the test verifies, followed
by an "Args:" section (None) and a "Returns:" section (None) to match existing
module style so Sphinx can parse them; place the docstring immediately under the
def line for each test and keep wording concise and consistent with other tests
in the file.

In `@tools/build-custom-vllm.sh`:
- Around line 24-28: Update the stale note and add an explicit warning about the
version mismatch: replace or remove the old "close to the v0.10 release" comment
near GIT_REF/VLLM_WHEEL_COMMIT and instead add a short comment next to
VLLM_WHEEL_COMMIT (and/or VLLM_PRECOMPILED_WHEEL_LOCATION) stating this
commit/build produces a vllm wheel tagged 0.14.0rc2.dev156 (from PR `#24322`)
which is intentionally outside the normal pyproject.toml constraint
(pyproject.toml requires vllm>=0.15.1,<0.16) and that the script unpins vllm
before install when using the custom wheel; reference the variables GIT_REF,
VLLM_WHEEL_COMMIT and VLLM_PRECOMPILED_WHEEL_LOCATION to locate where to update
the comments.
🧹 Nitpick comments (10)
examples/configs/recipes/llm/grpo-qwen3-14b-draft0.6b-spec-decode-tuned-1n8g.yaml (1)

1-44: Recipe filename doesn't follow the prescribed naming pattern.

The coding guideline requires LLM recipe names to follow: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers]. In the current filename, 1n8g appears at the end rather than immediately after the model identifier. Consider renaming to something like:

grpo-qwen3-14b-1n8g-specdec-draft0.6b-tuned.yaml

This places the node/GPU segment (1n8g) right after the model, with the strategy and modifiers following it.

As per coding guidelines: "For LLM recipes, follow the naming pattern: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].(yaml|sh)"

examples/configs/recipes/llm/grpo-qwen3-14b-spec-decode-1n8g.yaml (1)

31-35: draft_tensor_parallel_size: 2 may be unnecessary for a 1.7B draft model.

A 1.7B model typically fits comfortably on a single GPU. Using draft_tensor_parallel_size: 1 would free GPU resources for the main model and reduce communication overhead. Was TP=2 intentional here?

Also note: the AI summary incorrectly states the draft model is Qwen/Qwen3-0.6B, but the config specifies Qwen/Qwen3-1.7B.

docker/Dockerfile (1)

117-118: Consider adding a tracking issue or TODO for re-enabling the sglang extra.

The commented-out sync is a reasonable workaround for the broken upstream sgl-kernel build. To avoid this becoming a permanent dead comment, consider linking a tracking issue so it gets re-enabled once the upstream DeepGEMM reference is restored.

docs/repro-spec-decode-build.md (1)

26-28: Add language identifiers to fenced code blocks.

A few code blocks (Lines 26, 42, 245) are missing language specifiers, which was flagged by markdownlint. Add appropriate identifiers (e.g., ```text or ```log).

examples/configs/recipes/llm/grpo-qwen3-32b-no-spec-decode-lowbatch-1n8g.yaml (1)

16-23: Align max_new_tokens with the full context window.

max_total_sequence_length is 5120 while max_new_tokens is 4096, which diverges from the GRPO config convention. Consider matching the full context window for consistent generation limits.

♻️ Suggested change
-    max_new_tokens: 4096
+    max_new_tokens: 5120
Based on learnings: In distillation and GRPO configurations, max_new_tokens is intentionally set to the full context window (max_total_sequence_length) for consistency across the codebase.
nemo_rl/models/generation/vllm/vllm_worker_async.py (5)

150-205: Robust CompilationConfig field introspection – looks good overall.

The approach of first checking __pydantic_fields__ and falling back to inspect.signature is a reasonable forward-compatibility measure. One small note: the broad except (TypeError, ValueError): pass at lines 185-186 silently swallows introspection failures with no log message. Consider emitting a warning when this fallback path fails, so operators can diagnose issues when upgrading vLLM versions.

Suggested improvement
                     except (TypeError, ValueError):
-                        pass
+                        print(
+                            "⚠️[vLLM Init] Could not introspect CompilationConfig "
+                            "fields; passing compilation_config dict as-is.",
+                        )

260-337: _is_fallback_accepted_token_counter is overly broad — could match unrelated counters.

The fallback (lines 331–337) matches any counter whose normalized name contains "accept" and "token" (excluding "vllm:generation_tokens"), without requiring the counter to be spec-decode-related. Although the call site (line 382) gates it behind _spec_decode_proposed_counter_name is not None, a non-spec-decode counter (e.g., a hypothetical future vllm:kv_cache_accepted_token_evictions) could be incorrectly selected as the accepted-token counter.

Consider adding _is_spec_decode_related_counter as an additional guard in _is_fallback_accepted_token_counter (or at the call site):

Suggested tightening
         def _is_fallback_accepted_token_counter(name: str) -> bool:
             normalized_name = _normalize_metric_name(name)
             return (
                 "accept" in normalized_name
                 and "token" in normalized_name
                 and normalized_name != "vllm:generation_tokens"
+                and _is_spec_decode_related_counter(normalized_name)
             )

365-391: Counter discovery is order-dependent and latches on the first match.

The accepted/proposed counter names are set by the first matching counter encountered in get_metrics_snapshot() (first-write-wins via the is None guard). If vLLM exposes multiple counters that match the heuristic (e.g., both a total and a per-step variant), the selected counter depends on iteration order — which may vary across vLLM versions.

The preferred-name sets (lines 282–301) partially address this, but _is_accepted_token_counter / _is_proposed_token_counter also accept fuzzy matches that are checked after the preferred names, using the same first-write-wins pattern.

This is acceptable for now given the logging at lines 393-404 that announces the selection, but be aware this could silently pick a wrong counter after a vLLM upgrade. A future improvement could iterate all counters first, then select from the candidates using explicit priority.


388-391: Both if checks run unconditionally — if accepted and proposed counters ever share the same name, the value would be double-counted.

Lines 388–389 and 390–391 are independent if statements (not elif). While current logic ensures the names differ (proposed excludes "accept" and accepted excludes proposed-related keywords), this is a latent hazard if the heuristics are ever relaxed.


460-470: clear_vllm_logger_metrics resets token lists but not counter names or candidates-seen.

This appears intentional (counter names are discovered once and persist), but worth confirming: after a clear, the next get_vllm_logger_metrics call will still return the old spec_decode_accepted_counter_name, spec_decode_proposed_counter_name, and spec_decode_counter_candidates_seen. If the engine is reconfigured (e.g., spec-decode disabled between steps), stale counter names would persist. If this is the intended design, a brief comment would help future maintainers.

Comment on lines +1 to +6
# Reproduction Report: NeMo-RL v0.5.0 + vLLM Speculative Decoding with Draft Models

**Date:** 2026-02-10
**Author:** Shaunak J
**Host:** viking-prod-228 (8x NVIDIA H20, computelab-sc-01 cluster)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Internal infrastructure details exposed in an open-source repo.

Lines 5 and 259–264 reference internal hostnames (viking-prod-228, computelab-sc-01), user scratch paths (/home/scratch.shaunakj_other/), and cluster details. If this repo is public, consider redacting or generalizing these to placeholder paths (e.g., <your-scratch-dir>).

🤖 Prompt for AI Agents
In `@docs/repro-spec-decode-build.md` around lines 1 - 6, The doc exposes internal
infra details—replace the specific hostnames and paths mentioned in the header
and body (e.g., "viking-prod-228", "computelab-sc-01", and
"/home/scratch.shaunakj_other/") with generalized placeholders like
"<host-name>" or "<your-scratch-dir>" across the "Reproduction Report: NeMo-RL
v0.5.0 + vLLM Speculative Decoding with Draft Models" document; search for those
exact strings and update them to non-sensitive placeholders and, if needed, add
a short note that readers should substitute their own environment values.

Comment on lines +1 to +46
defaults: ../../grpo_math_1B.yaml
grpo:
max_num_steps: 200
num_prompts_per_step: 32
num_generations_per_prompt: 16
checkpointing:
checkpoint_dir: results/grpo-qwen2.5-7b-spec-decode-1n8g
policy:
model_name: Qwen/Qwen3-4B
tokenizer:
name: Qwen/Qwen3-4B
train_global_batch_size: 512
train_micro_batch_size: 1
logprob_batch_size: 1
max_total_sequence_length: 2048
dynamic_batching:
enabled: true
sequence_packing:
enabled: false
make_sequence_length_divisible_by: 1
generation:
max_new_tokens: 1024
vllm_cfg:
tensor_parallel_size: 1
gpu_memory_utilization: 0.85
max_model_len: 2048
async_engine: true
enable_vllm_metrics_logger: true
vllm_metrics_logger_interval: 0.5
vllm_kwargs:
speculative_config:
method: draft_model
model: Qwen/Qwen3-0.6B
num_speculative_tokens: 5
draft_tensor_parallel_size: 1
data:
max_input_seq_length: 1024
logger:
log_dir: logs/grpo-qwen2.5-7b-spec-decode-1n8g
wandb_enabled: false
tensorboard_enabled: true
wandb:
project: nemo-rl
name: grpo-qwen2.5-7b-spec-decode-1n8g
cluster:
gpus_per_node: 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Filename/model mismatch: file says qwen2.5-7b but config uses Qwen3-4B.

The filename grpo-qwen2.5-7b-spec-decode-1n8g.yaml implies this is a Qwen 2.5 7B configuration, but the actual model_name and tokenizer (Lines 9, 11) reference Qwen/Qwen3-4B. The checkpoint dir, log dir, and wandb name also still reference qwen2.5-7b. Either rename the file and paths to reflect the actual model, or update the model/tokenizer config to match the filename.

🤖 Prompt for AI Agents
In `@examples/configs/recipes/llm/grpo-qwen2.5-7b-spec-decode-1n8g.yaml` around
lines 1 - 46, The config filename and several path/name fields reference
qwen2.5-7b but the policy.model_name and policy.tokenizer.name are set to
Qwen/Qwen3-4B; make them consistent by either (A) updating policy.model_name and
policy.tokenizer.name to the intended Qwen2.5-7b model identifier (and adjust
any vllm_kwargs/speculative model names accordingly), or (B) rename the filename
and all related identifiers (checkpoint_dir, logger.log_dir, logger.wandb.name,
and any other qwen2.5-7b strings) to reflect Qwen3-4B; ensure you also update
the speculative_config.model if using speculative decoding so it matches the
chosen base model.

Comment on lines +1 to +34
defaults: /opt/nemo-rl/examples/configs/grpo_math_1B.yaml
grpo:
max_num_steps: 200
num_prompts_per_step: 32
num_generations_per_prompt: 16
checkpointing:
checkpoint_dir: results/grpo-qwen3-14b-no-spec-decode-1n8g
policy:
model_name: Qwen/Qwen3-14B
tokenizer:
name: Qwen/Qwen3-14B
train_global_batch_size: 512
train_micro_batch_size: 1
logprob_batch_size: 1
max_total_sequence_length: 2048
dynamic_batching:
enabled: true
sequence_packing:
enabled: false
make_sequence_length_divisible_by: 1
generation:
max_new_tokens: 1024
vllm_cfg:
tensor_parallel_size: 2
gpu_memory_utilization: 0.80
max_model_len: 2048
data:
max_input_seq_length: 1024
logger:
log_dir: logs/grpo-qwen3-14b-no-spec-decode-1n8g
wandb_enabled: false
tensorboard_enabled: true
cluster:
gpus_per_node: 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistencies with the lowbatch variant of this config.

Compared to grpo-qwen3-14b-no-spec-decode-lowbatch-1n8g.yaml, this file is missing several settings that the lowbatch variant includes:

  • checkpointing.enabled: false
  • policy.generation.temperature
  • policy.generation.vllm_cfg.async_engine
  • policy.generation.vllm_cfg.enable_vllm_metrics_logger / vllm_metrics_logger_interval

If these omissions are intentional (relying on defaults), that's fine — but checkpointing.enabled being absent here while explicitly set in the lowbatch variant is worth double-checking to ensure the intended behavior.

🤖 Prompt for AI Agents
In `@examples/configs/recipes/llm/grpo-qwen3-14b-no-spec-decode-1n8g.yaml` around
lines 1 - 34, This config is missing several keys present in the lowbatch
variant; update the YAML to explicitly add checkpointing.enabled (e.g., false if
you want the same behavior as lowbatch) and the generation keys under policy:
policy.generation.temperature, policy.generation.vllm_cfg.async_engine, and
policy.generation.vllm_cfg.enable_vllm_metrics_logger (and
vllm_metrics_logger_interval) to match the lowbatch settings or to the intended
values; locate these settings by the symbols checkpointing.enabled,
policy.generation.temperature, policy.generation.vllm_cfg.async_engine,
policy.generation.vllm_cfg.enable_vllm_metrics_logger, and
vllm_metrics_logger_interval and add them with the desired values so behavior is
explicit rather than relying on implicit defaults.

@@ -0,0 +1,36 @@
defaults: /opt/nemo-rl/examples/configs/grpo_math_1B.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Rename recipe file to match the LLM recipe naming convention.

The node/GPU segment is currently at the end of the filename; the repo pattern expects it immediately after <model>. Consider renaming to something like grpo-qwen3-32b-1n8g-no-spec-decode-lowbatch.yaml to keep tooling consistent.

As per coding guidelines: For LLM recipes, follow the naming pattern: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].(yaml|sh).

🤖 Prompt for AI Agents
In
`@examples/configs/recipes/llm/grpo-qwen3-32b-no-spec-decode-lowbatch-1n8g.yaml`
at line 1, The recipe filename violates the LLM naming convention by placing the
node/GPU segment at the end; rename the file from
grpo-qwen3-32b-no-spec-decode-lowbatch-1n8g.yaml to follow the pattern used by
other LLM recipes (e.g., grpo-qwen3-32b-1n8g-no-spec-decode-lowbatch.yaml) so
the node/GPU segment appears immediately after the model, ensuring consistency
with the naming rule
"<algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].(yaml|sh)"
and update any references to this filename in tooling or docs (search for
occurrences of grpo-qwen3-32b-no-spec-decode-lowbatch-1n8g.yaml and replace).

max_new_tokens: 4096
temperature: 0.6
vllm_cfg:
tensor_parallel_size: 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

tensor_parallel_size: 1 may be tight for a 32B model.

A 32B parameter model in BF16 requires ~64 GB just for weights. With gpu_memory_utilization: 0.85, the available budget on an 80 GB GPU (~68 GB) leaves very little room for KV cache and the draft model. This will work on 96 GB GPUs (H20) but may OOM on H100 80 GB. Consider documenting the minimum GPU memory requirement or bumping TP if this config is intended to be general-purpose.

🤖 Prompt for AI Agents
In `@examples/configs/recipes/llm/grpo-qwen3-32b-spec-decode-lowbatch-1n8g.yaml`
at line 26, The config sets tensor_parallel_size: 1 for a 32B model which is
likely to OOM on 80GB GPUs when gpu_memory_utilization is 0.85; update the YAML
to either document a minimum GPU memory requirement (mentioning 96GB/H20 or
similar) or increase tensor_parallel_size (e.g., to 2 or more) so model weights
fit alongside KV cache and draft model—adjust the comments and any README
references for this example and ensure the fields tensor_parallel_size and
gpu_memory_utilization in the grpo-qwen3-32b-spec-decode-lowbatch-1n8g.yaml
reflect the new recommended values or include the minimum-GPU note.

Comment on lines +386 to +415
def _counter_delta_over_step(counter_series: list[Any]) -> float:
"""Estimate step-local increments from a cumulative counter timeline."""
parsed_values = []
for value in counter_series:
try:
parsed_values.append(float(value))
except (TypeError, ValueError):
continue

if len(parsed_values) < 2:
return 0.0

delta = parsed_values[-1] - parsed_values[0]
# If the counter resets during the step, fallback to the latest counter value.
if delta < 0:
return max(0.0, parsed_values[-1])
return delta


def compute_spec_decode_token_acceptance_metrics(
generation_logger_metrics: dict[str, Any],
) -> dict[str, float]:
"""Compute speculative decoding token acceptance metrics from logger timelines.

Returns:
A dictionary with accepted/proposed draft token deltas and, when available,
acceptance rate. If proposed tokens are present but the accepted-token
counter is unavailable, acceptance rate is intentionally omitted so callers
can report `N/A` instead of a misleading `0.0`.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add Google‑style docstrings for the new helpers.

Both _counter_delta_over_step and compute_spec_decode_token_acceptance_metrics are missing Args (and _counter_delta_over_step missing Returns) sections required by the repo docstring standard.

✍️ Docstring update
 def _counter_delta_over_step(counter_series: list[Any]) -> float:
-    """Estimate step-local increments from a cumulative counter timeline."""
+    """Estimate step-local increments from a cumulative counter timeline.
+
+    Args:
+        counter_series: Timeline of cumulative counter values for a step.
+
+    Returns:
+        Estimated step-local increment.
+    """
 def compute_spec_decode_token_acceptance_metrics(
     generation_logger_metrics: dict[str, Any],
 ) -> dict[str, float]:
-    """Compute speculative decoding token acceptance metrics from logger timelines.
-
-    Returns:
-        A dictionary with accepted/proposed draft token deltas and, when available,
-        acceptance rate. If proposed tokens are present but the accepted-token
-        counter is unavailable, acceptance rate is intentionally omitted so callers
-        can report `N/A` instead of a misleading `0.0`.
-    """
+    """Compute speculative decoding token acceptance metrics from logger timelines.
+
+    Args:
+        generation_logger_metrics: Per-worker vLLM logger timelines.
+
+    Returns:
+        A dictionary with accepted/proposed draft token deltas and, when available,
+        acceptance rate. If proposed tokens are present but the accepted-token
+        counter is unavailable, acceptance rate is intentionally omitted so callers
+        can report `N/A` instead of a misleading `0.0`.
+    """
As per coding guidelines: Use Google style docstrings for classes and functions, which can be parsed by Sphinx.

Also applies to: 405-415

🤖 Prompt for AI Agents
In `@nemo_rl/algorithms/utils.py` around lines 386 - 415, Both helper functions
lack Google-style docstring sections: add an "Args" section to
_counter_delta_over_step(counter_series) describing counter_series: list[Any]
and its expected content/type, and add both "Args" and "Returns" to
compute_spec_decode_token_acceptance_metrics(generation_logger_metrics)
describing generation_logger_metrics: dict[str, Any] and the returned dict[str,
float]; ensure the docstrings follow the repo's Google style (param
descriptions, return description, types) and keep existing summary lines intact
for _counter_delta_over_step and compute_spec_decode_token_acceptance_metrics.

# sudo apt-get install libibverbs-dev
"deep_ep @ git+https://github.com/deepseek-ai/DeepEP.git@bfded34800dfec415b71503f8205181de90b2480",
"vllm==0.11.2",
"vllm>=0.15.1,<0.16",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find all vLLM imports in the codebase
rg -n --type=py 'from vllm|import vllm' -g '!**/test/**' -g '!**/.venv/**' -g '!**/3rdparty/**' | head -100

Repository: NVIDIA-NeMo/RL

Length of output: 4382


🏁 Script executed:

#!/bin/bash
# Check git history to see what the original vllm version was
git log -p --follow -S 'vllm' -- pyproject.toml | grep -A 5 -B 5 'vllm' | head -80

Repository: NVIDIA-NeMo/RL

Length of output: 40


🏁 Script executed:

#!/bin/bash
# Check the build script to see the git ref and how it's used
cat tools/build-custom-vllm.sh | head -150

Repository: NVIDIA-NeMo/RL

Length of output: 6116


🌐 Web query:

vLLM v1 API introduction version history 0.14 0.15

💡 Result:

What “vLLM V1” refers to (engine + serving stack)

  • V1 (the re-architected core engine) was announced as an alpha on Jan 27, 2025, enabled via VLLM_USE_V1=1, while keeping the existing Python/OpenAI-compatible surface APIs unchanged. [1]
  • V1 became the default engine in vLLM v0.8.0 (per the vLLM maintainers’ deprecation RFC for V0). [2]

Version history around v0.14 → v0.15 (Jan 2026)

Release dates (PyPI):

  • v0.14.0 — Jan 20, 2026; v0.14.1 — Jan 24, 2026; v0.15.0 — Jan 29, 2026; v0.15.1 — Feb 5, 2026. [3]

vLLM v0.14.0 (Jan 20, 2026) — notable “API/serving” items

  • Added/expanded API & Frontend features including gRPC server, offline FastAPI docs, LLM(..., attention_config=...), /server_info, and MCP streaming in the Responses API (plus various CLI/API improvements). [4]

vLLM v0.14.1 (Jan 24, 2026)

  • Patch release: security + memory leak fixes on top of v0.14.0. [5]

vLLM v0.15.0 (Jan 29, 2026) — notable “v1/* OpenAI-style API” items

  • Responses API improvements (e.g., partial message generation; prompt_cache_key; tuning for include_stop_str_in_output). [6]
  • OpenAI API tweak: skip_special_tokens configuration. [6]
  • Added/updated other endpoints such as score endpoint input formats and render endpoints for prompt preprocessing. [6]

vLLM v0.15.1 (Feb 5, 2026)

  • Patch release: security fixes + GPU (RTX Blackwell) fixes + bug fixes. [7]

Sources: vLLM blog post on V1 alpha [1]; vLLM GitHub RFC on V0 deprecation/V1 default timeline [2]; PyPI release history [3]; GitHub release notes for v0.14.0 [4], v0.14.1 [5], v0.15.0 [6], v0.15.1 [7].


🏁 Script executed:

#!/bin/bash
# Check the current pyproject.toml to confirm the vllm version constraint
cat pyproject.toml | grep -A 5 -B 5 'vllm'

Repository: NVIDIA-NeMo/RL

Length of output: 2511


Verify vLLM v1 API availability across 0.14.x and 0.15.x versions.

The dependency moves from vllm==0.11.2 to vllm>=0.15.1,<0.16. The codebase imports from vllm.v1.* (e.g., vllm.v1.engine.async_llm, vllm.v1.metrics.*, vllm.v1.executor.*), which exist in both v0.14.x and v0.15.x since the v1 API was introduced in v0.8.0.

However, the custom build script (tools/build-custom-vllm.sh) produces a v0.14.0rc2 dev build while PyPI users will install v0.15.1+. Although the build script unpins vLLM in pyproject.toml during custom installs, ensure all used vLLM APIs remain compatible across both versions to support both installation paths.

🤖 Prompt for AI Agents
In `@pyproject.toml` at line 79, Dependency change to "vllm>=0.15.1,<0.16" may
mismatch the custom build v0.14.0rc2; audit all imports from the v1 API (e.g.,
vllm.v1.engine.async_llm, vllm.v1.metrics.*, vllm.v1.executor.*) to ensure the
specific functions/classes we use exist and behave identically in both v0.14.x
and v0.15.x, run unit/integration tests against both versions, and if any
incompatibility is found either pin pyproject.toml to a compatible range or add
conditional imports/shims (or adjust tools/build-custom-vllm.sh to produce a
matching v0.15.x dev build) so both PyPI installs and custom builds work
consistently.

Comment on lines +599 to +626
def test_compute_spec_decode_token_acceptance_metrics_normal_case():
generation_logger_metrics = {
"spec_decode_accepted_tokens": {0: [10, 28], 1: [3, 11]},
"spec_decode_proposed_tokens": {0: [20, 50], 1: [8, 24]},
}

metrics = compute_spec_decode_token_acceptance_metrics(generation_logger_metrics)

assert metrics["accepted_draft_tokens"] == 26.0
assert metrics["proposed_draft_tokens"] == 46.0
assert metrics["token_acceptance_rate"] == 26.0 / 46.0
assert metrics["spec_decode_accepted_counter_found"] == 1.0
assert metrics["spec_decode_proposed_counter_found"] == 1.0


def test_compute_spec_decode_token_acceptance_metrics_missing_accepted_counter():
generation_logger_metrics = {
"spec_decode_accepted_tokens": {},
"spec_decode_proposed_tokens": {0: [120, 170]},
}

metrics = compute_spec_decode_token_acceptance_metrics(generation_logger_metrics)

assert metrics["accepted_draft_tokens"] == 0.0
assert metrics["proposed_draft_tokens"] == 50.0
assert "token_acceptance_rate" not in metrics
assert metrics["spec_decode_accepted_counter_found"] == 0.0
assert metrics["spec_decode_proposed_counter_found"] == 1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add Google‑style docstrings to the new tests.

The new test functions are missing docstrings while the rest of the module uses them.

✍️ Suggested docstrings
 def test_compute_spec_decode_token_acceptance_metrics_normal_case():
+    """Verify acceptance metrics for normal counter timelines."""
     generation_logger_metrics = {
         "spec_decode_accepted_tokens": {0: [10, 28], 1: [3, 11]},
         "spec_decode_proposed_tokens": {0: [20, 50], 1: [8, 24]},
     }
@@
 def test_compute_spec_decode_token_acceptance_metrics_missing_accepted_counter():
+    """Verify behavior when the accepted counter is unavailable."""
     generation_logger_metrics = {
         "spec_decode_accepted_tokens": {},
         "spec_decode_proposed_tokens": {0: [120, 170]},
     }
As per coding guidelines: Use Google style docstrings for classes and functions, which can be parsed by Sphinx.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_compute_spec_decode_token_acceptance_metrics_normal_case():
generation_logger_metrics = {
"spec_decode_accepted_tokens": {0: [10, 28], 1: [3, 11]},
"spec_decode_proposed_tokens": {0: [20, 50], 1: [8, 24]},
}
metrics = compute_spec_decode_token_acceptance_metrics(generation_logger_metrics)
assert metrics["accepted_draft_tokens"] == 26.0
assert metrics["proposed_draft_tokens"] == 46.0
assert metrics["token_acceptance_rate"] == 26.0 / 46.0
assert metrics["spec_decode_accepted_counter_found"] == 1.0
assert metrics["spec_decode_proposed_counter_found"] == 1.0
def test_compute_spec_decode_token_acceptance_metrics_missing_accepted_counter():
generation_logger_metrics = {
"spec_decode_accepted_tokens": {},
"spec_decode_proposed_tokens": {0: [120, 170]},
}
metrics = compute_spec_decode_token_acceptance_metrics(generation_logger_metrics)
assert metrics["accepted_draft_tokens"] == 0.0
assert metrics["proposed_draft_tokens"] == 50.0
assert "token_acceptance_rate" not in metrics
assert metrics["spec_decode_accepted_counter_found"] == 0.0
assert metrics["spec_decode_proposed_counter_found"] == 1.0
def test_compute_spec_decode_token_acceptance_metrics_normal_case():
"""Verify acceptance metrics for normal counter timelines."""
generation_logger_metrics = {
"spec_decode_accepted_tokens": {0: [10, 28], 1: [3, 11]},
"spec_decode_proposed_tokens": {0: [20, 50], 1: [8, 24]},
}
metrics = compute_spec_decode_token_acceptance_metrics(generation_logger_metrics)
assert metrics["accepted_draft_tokens"] == 26.0
assert metrics["proposed_draft_tokens"] == 46.0
assert metrics["token_acceptance_rate"] == 26.0 / 46.0
assert metrics["spec_decode_accepted_counter_found"] == 1.0
assert metrics["spec_decode_proposed_counter_found"] == 1.0
def test_compute_spec_decode_token_acceptance_metrics_missing_accepted_counter():
"""Verify behavior when the accepted counter is unavailable."""
generation_logger_metrics = {
"spec_decode_accepted_tokens": {},
"spec_decode_proposed_tokens": {0: [120, 170]},
}
metrics = compute_spec_decode_token_acceptance_metrics(generation_logger_metrics)
assert metrics["accepted_draft_tokens"] == 0.0
assert metrics["proposed_draft_tokens"] == 50.0
assert "token_acceptance_rate" not in metrics
assert metrics["spec_decode_accepted_counter_found"] == 0.0
assert metrics["spec_decode_proposed_counter_found"] == 1.0
🤖 Prompt for AI Agents
In `@tests/unit/algorithms/test_utils.py` around lines 599 - 626, Add Google-style
docstrings to the two test functions
test_compute_spec_decode_token_acceptance_metrics_normal_case and
test_compute_spec_decode_token_acceptance_metrics_missing_accepted_counter: each
docstring should be a short one-line summary of what the test verifies, followed
by an "Args:" section (None) and a "Returns:" section (None) to match existing
module style so Sphinx can parse them; place the docstring immediately under the
def line for each test and keep wording concise and consistent with other tests
in the file.

Comment on lines +24 to +28
GIT_REF=${2:-4a5299c93ff97c26def537b92562df5ada530fea}
# NOTE: VLLM_USE_PRECOMPILED=1 didn't always seem to work since the wheels were sometimes built against an incompatible torch/cuda combo.
# This commit was chosen as one close to the v0.10 release: git merge-base --fork-point origin/main tags/v0.10.0
VLLM_WHEEL_COMMIT=${3:-862f2ef893d9751db0a92bd2d4ae0e3d9677872f} # use full commit hash from the main branch
export VLLM_PRECOMPILED_WHEEL_LOCATION="https://wheels.vllm.ai/${VLLM_WHEEL_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl"
VLLM_WHEEL_COMMIT=${3:-4a5299c93ff97c26def537b92562df5ada530fea} # merge commit of vllm PR #24322 (spec decode with draft models)
export VLLM_PRECOMPILED_WHEEL_LOCATION="${VLLM_PRECOMPILED_WHEEL_LOCATION:-https://wheels.vllm.ai/${VLLM_WHEEL_COMMIT}/vllm-0.14.0rc2.dev156%2Bg4a5299c93-cp38-abi3-manylinux_2_31_x86_64.whl}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Stale comment on line 26 and version mismatch with pyproject.toml.

Two issues:

  1. Stale comment (line 26): The note "This commit was chosen as one close to the v0.10 release" no longer applies — the commit is now from vLLM PR #24322 (spec decode with draft models), producing a 0.14.0rc2 dev build. Update or remove the stale comment to avoid confusion.

  2. Version mismatch: The wheel filename embeds version 0.14.0rc2.dev156, but pyproject.toml requires vllm>=0.15.1,<0.16. While the build script unpins vllm in pyproject.toml before installing, this means the custom-build path produces a vLLM version that would not satisfy the standard dependency constraint. This is worth documenting explicitly (e.g., a comment near VLLM_WHEEL_COMMIT) so users understand the custom build is intentionally outside the pinned range.

Suggested comment update
-# NOTE: VLLM_USE_PRECOMPILED=1 didn't always seem to work since the wheels were sometimes built against an incompatible torch/cuda combo.
-# This commit was chosen as one close to the v0.10 release: git merge-base --fork-point origin/main tags/v0.10.0
-VLLM_WHEEL_COMMIT=${3:-4a5299c93ff97c26def537b92562df5ada530fea}  # merge commit of vllm PR `#24322` (spec decode with draft models)
+# NOTE: VLLM_USE_PRECOMPILED=1 didn't always seem to work since the wheels were sometimes built against an incompatible torch/cuda combo.
+# This commit produces a 0.14.0rc2 dev wheel, which is intentionally outside the >=0.15.1 range in pyproject.toml.
+# The build script unpins the vllm constraint to allow this local editable install.
+VLLM_WHEEL_COMMIT=${3:-4a5299c93ff97c26def537b92562df5ada530fea}  # merge commit of vllm PR `#24322` (spec decode with draft models)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
GIT_REF=${2:-4a5299c93ff97c26def537b92562df5ada530fea}
# NOTE: VLLM_USE_PRECOMPILED=1 didn't always seem to work since the wheels were sometimes built against an incompatible torch/cuda combo.
# This commit was chosen as one close to the v0.10 release: git merge-base --fork-point origin/main tags/v0.10.0
VLLM_WHEEL_COMMIT=${3:-862f2ef893d9751db0a92bd2d4ae0e3d9677872f} # use full commit hash from the main branch
export VLLM_PRECOMPILED_WHEEL_LOCATION="https://wheels.vllm.ai/${VLLM_WHEEL_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl"
VLLM_WHEEL_COMMIT=${3:-4a5299c93ff97c26def537b92562df5ada530fea} # merge commit of vllm PR #24322 (spec decode with draft models)
export VLLM_PRECOMPILED_WHEEL_LOCATION="${VLLM_PRECOMPILED_WHEEL_LOCATION:-https://wheels.vllm.ai/${VLLM_WHEEL_COMMIT}/vllm-0.14.0rc2.dev156%2Bg4a5299c93-cp38-abi3-manylinux_2_31_x86_64.whl}"
GIT_REF=${2:-4a5299c93ff97c26def537b92562df5ada530fea}
# NOTE: VLLM_USE_PRECOMPILED=1 didn't always seem to work since the wheels were sometimes built against an incompatible torch/cuda combo.
# This commit produces a 0.14.0rc2 dev wheel, which is intentionally outside the >=0.15.1 range in pyproject.toml.
# The build script unpins the vllm constraint to allow this local editable install.
VLLM_WHEEL_COMMIT=${3:-4a5299c93ff97c26def537b92562df5ada530fea} # merge commit of vllm PR `#24322` (spec decode with draft models)
export VLLM_PRECOMPILED_WHEEL_LOCATION="${VLLM_PRECOMPILED_WHEEL_LOCATION:-https://wheels.vllm.ai/${VLLM_WHEEL_COMMIT}/vllm-0.14.0rc2.dev156%2Bg4a5299c93-cp38-abi3-manylinux_2_31_x86_64.whl}"
🤖 Prompt for AI Agents
In `@tools/build-custom-vllm.sh` around lines 24 - 28, Update the stale note and
add an explicit warning about the version mismatch: replace or remove the old
"close to the v0.10 release" comment near GIT_REF/VLLM_WHEEL_COMMIT and instead
add a short comment next to VLLM_WHEEL_COMMIT (and/or
VLLM_PRECOMPILED_WHEEL_LOCATION) stating this commit/build produces a vllm wheel
tagged 0.14.0rc2.dev156 (from PR `#24322`) which is intentionally outside the
normal pyproject.toml constraint (pyproject.toml requires vllm>=0.15.1,<0.16)
and that the script unpins vllm before install when using the custom wheel;
reference the variables GIT_REF, VLLM_WHEEL_COMMIT and
VLLM_PRECOMPILED_WHEEL_LOCATION to locate where to update the comments.

@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Feb 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request documentation Improvements or additions to documentation needs-follow-up Issue needs follow-up

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants