Conversation
📝 WalkthroughWalkthroughThis PR adds speculative decoding metrics collection and logging to the GRPO training pipeline, updates vLLM-related build dependencies and configurations, adds nine GRPO configuration files for various Qwen model setups with speculative decoding, and includes reproduction documentation and build tooling updates. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 1 | ❌ 4❌ Failed checks (3 warnings, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🤖 Fix all issues with AI agents
In `@docs/repro-spec-decode-build.md`:
- Around line 1-6: The doc exposes internal infra details—replace the specific
hostnames and paths mentioned in the header and body (e.g., "viking-prod-228",
"computelab-sc-01", and "/home/scratch.shaunakj_other/") with generalized
placeholders like "<host-name>" or "<your-scratch-dir>" across the "Reproduction
Report: NeMo-RL v0.5.0 + vLLM Speculative Decoding with Draft Models" document;
search for those exact strings and update them to non-sensitive placeholders
and, if needed, add a short note that readers should substitute their own
environment values.
In `@examples/configs/recipes/llm/grpo-qwen2.5-7b-spec-decode-1n8g.yaml`:
- Around line 1-46: The config filename and several path/name fields reference
qwen2.5-7b but the policy.model_name and policy.tokenizer.name are set to
Qwen/Qwen3-4B; make them consistent by either (A) updating policy.model_name and
policy.tokenizer.name to the intended Qwen2.5-7b model identifier (and adjust
any vllm_kwargs/speculative model names accordingly), or (B) rename the filename
and all related identifiers (checkpoint_dir, logger.log_dir, logger.wandb.name,
and any other qwen2.5-7b strings) to reflect Qwen3-4B; ensure you also update
the speculative_config.model if using speculative decoding so it matches the
chosen base model.
In `@examples/configs/recipes/llm/grpo-qwen3-14b-no-spec-decode-1n8g.yaml`:
- Around line 1-34: This config is missing several keys present in the lowbatch
variant; update the YAML to explicitly add checkpointing.enabled (e.g., false if
you want the same behavior as lowbatch) and the generation keys under policy:
policy.generation.temperature, policy.generation.vllm_cfg.async_engine, and
policy.generation.vllm_cfg.enable_vllm_metrics_logger (and
vllm_metrics_logger_interval) to match the lowbatch settings or to the intended
values; locate these settings by the symbols checkpointing.enabled,
policy.generation.temperature, policy.generation.vllm_cfg.async_engine,
policy.generation.vllm_cfg.enable_vllm_metrics_logger, and
vllm_metrics_logger_interval and add them with the desired values so behavior is
explicit rather than relying on implicit defaults.
In
`@examples/configs/recipes/llm/grpo-qwen3-32b-no-spec-decode-lowbatch-1n8g.yaml`:
- Line 1: The recipe filename violates the LLM naming convention by placing the
node/GPU segment at the end; rename the file from
grpo-qwen3-32b-no-spec-decode-lowbatch-1n8g.yaml to follow the pattern used by
other LLM recipes (e.g., grpo-qwen3-32b-1n8g-no-spec-decode-lowbatch.yaml) so
the node/GPU segment appears immediately after the model, ensuring consistency
with the naming rule
"<algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].(yaml|sh)"
and update any references to this filename in tooling or docs (search for
occurrences of grpo-qwen3-32b-no-spec-decode-lowbatch-1n8g.yaml and replace).
In `@examples/configs/recipes/llm/grpo-qwen3-32b-spec-decode-lowbatch-1n8g.yaml`:
- Line 26: The config sets tensor_parallel_size: 1 for a 32B model which is
likely to OOM on 80GB GPUs when gpu_memory_utilization is 0.85; update the YAML
to either document a minimum GPU memory requirement (mentioning 96GB/H20 or
similar) or increase tensor_parallel_size (e.g., to 2 or more) so model weights
fit alongside KV cache and draft model—adjust the comments and any README
references for this example and ensure the fields tensor_parallel_size and
gpu_memory_utilization in the grpo-qwen3-32b-spec-decode-lowbatch-1n8g.yaml
reflect the new recommended values or include the minimum-GPU note.
In `@nemo_rl/algorithms/utils.py`:
- Around line 386-415: Both helper functions lack Google-style docstring
sections: add an "Args" section to _counter_delta_over_step(counter_series)
describing counter_series: list[Any] and its expected content/type, and add both
"Args" and "Returns" to
compute_spec_decode_token_acceptance_metrics(generation_logger_metrics)
describing generation_logger_metrics: dict[str, Any] and the returned dict[str,
float]; ensure the docstrings follow the repo's Google style (param
descriptions, return description, types) and keep existing summary lines intact
for _counter_delta_over_step and compute_spec_decode_token_acceptance_metrics.
In `@pyproject.toml`:
- Line 79: Dependency change to "vllm>=0.15.1,<0.16" may mismatch the custom
build v0.14.0rc2; audit all imports from the v1 API (e.g.,
vllm.v1.engine.async_llm, vllm.v1.metrics.*, vllm.v1.executor.*) to ensure the
specific functions/classes we use exist and behave identically in both v0.14.x
and v0.15.x, run unit/integration tests against both versions, and if any
incompatibility is found either pin pyproject.toml to a compatible range or add
conditional imports/shims (or adjust tools/build-custom-vllm.sh to produce a
matching v0.15.x dev build) so both PyPI installs and custom builds work
consistently.
In `@tests/unit/algorithms/test_utils.py`:
- Around line 599-626: Add Google-style docstrings to the two test functions
test_compute_spec_decode_token_acceptance_metrics_normal_case and
test_compute_spec_decode_token_acceptance_metrics_missing_accepted_counter: each
docstring should be a short one-line summary of what the test verifies, followed
by an "Args:" section (None) and a "Returns:" section (None) to match existing
module style so Sphinx can parse them; place the docstring immediately under the
def line for each test and keep wording concise and consistent with other tests
in the file.
In `@tools/build-custom-vllm.sh`:
- Around line 24-28: Update the stale note and add an explicit warning about the
version mismatch: replace or remove the old "close to the v0.10 release" comment
near GIT_REF/VLLM_WHEEL_COMMIT and instead add a short comment next to
VLLM_WHEEL_COMMIT (and/or VLLM_PRECOMPILED_WHEEL_LOCATION) stating this
commit/build produces a vllm wheel tagged 0.14.0rc2.dev156 (from PR `#24322`)
which is intentionally outside the normal pyproject.toml constraint
(pyproject.toml requires vllm>=0.15.1,<0.16) and that the script unpins vllm
before install when using the custom wheel; reference the variables GIT_REF,
VLLM_WHEEL_COMMIT and VLLM_PRECOMPILED_WHEEL_LOCATION to locate where to update
the comments.
🧹 Nitpick comments (10)
examples/configs/recipes/llm/grpo-qwen3-14b-draft0.6b-spec-decode-tuned-1n8g.yaml (1)
1-44: Recipe filename doesn't follow the prescribed naming pattern.The coding guideline requires LLM recipe names to follow:
<algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers]. In the current filename,1n8gappears at the end rather than immediately after the model identifier. Consider renaming to something like:grpo-qwen3-14b-1n8g-specdec-draft0.6b-tuned.yamlThis places the node/GPU segment (
1n8g) right after the model, with the strategy and modifiers following it.As per coding guidelines: "For LLM recipes, follow the naming pattern:
<algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].(yaml|sh)"examples/configs/recipes/llm/grpo-qwen3-14b-spec-decode-1n8g.yaml (1)
31-35:draft_tensor_parallel_size: 2may be unnecessary for a 1.7B draft model.A 1.7B model typically fits comfortably on a single GPU. Using
draft_tensor_parallel_size: 1would free GPU resources for the main model and reduce communication overhead. Was TP=2 intentional here?Also note: the AI summary incorrectly states the draft model is
Qwen/Qwen3-0.6B, but the config specifiesQwen/Qwen3-1.7B.docker/Dockerfile (1)
117-118: Consider adding a tracking issue or TODO for re-enabling the sglang extra.The commented-out sync is a reasonable workaround for the broken upstream
sgl-kernelbuild. To avoid this becoming a permanent dead comment, consider linking a tracking issue so it gets re-enabled once the upstream DeepGEMM reference is restored.docs/repro-spec-decode-build.md (1)
26-28: Add language identifiers to fenced code blocks.A few code blocks (Lines 26, 42, 245) are missing language specifiers, which was flagged by markdownlint. Add appropriate identifiers (e.g.,
```textor```log).examples/configs/recipes/llm/grpo-qwen3-32b-no-spec-decode-lowbatch-1n8g.yaml (1)
16-23: Alignmax_new_tokenswith the full context window.
max_total_sequence_lengthis 5120 whilemax_new_tokensis 4096, which diverges from the GRPO config convention. Consider matching the full context window for consistent generation limits.Based on learnings: In distillation and GRPO configurations, max_new_tokens is intentionally set to the full context window (max_total_sequence_length) for consistency across the codebase.♻️ Suggested change
- max_new_tokens: 4096 + max_new_tokens: 5120nemo_rl/models/generation/vllm/vllm_worker_async.py (5)
150-205: RobustCompilationConfigfield introspection – looks good overall.The approach of first checking
__pydantic_fields__and falling back toinspect.signatureis a reasonable forward-compatibility measure. One small note: the broadexcept (TypeError, ValueError): passat lines 185-186 silently swallows introspection failures with no log message. Consider emitting a warning when this fallback path fails, so operators can diagnose issues when upgrading vLLM versions.Suggested improvement
except (TypeError, ValueError): - pass + print( + "⚠️[vLLM Init] Could not introspect CompilationConfig " + "fields; passing compilation_config dict as-is.", + )
260-337:_is_fallback_accepted_token_counteris overly broad — could match unrelated counters.The fallback (lines 331–337) matches any counter whose normalized name contains
"accept"and"token"(excluding"vllm:generation_tokens"), without requiring the counter to be spec-decode-related. Although the call site (line 382) gates it behind_spec_decode_proposed_counter_name is not None, a non-spec-decode counter (e.g., a hypothetical futurevllm:kv_cache_accepted_token_evictions) could be incorrectly selected as the accepted-token counter.Consider adding
_is_spec_decode_related_counteras an additional guard in_is_fallback_accepted_token_counter(or at the call site):Suggested tightening
def _is_fallback_accepted_token_counter(name: str) -> bool: normalized_name = _normalize_metric_name(name) return ( "accept" in normalized_name and "token" in normalized_name and normalized_name != "vllm:generation_tokens" + and _is_spec_decode_related_counter(normalized_name) )
365-391: Counter discovery is order-dependent and latches on the first match.The accepted/proposed counter names are set by the first matching counter encountered in
get_metrics_snapshot()(first-write-wins via theis Noneguard). If vLLM exposes multiple counters that match the heuristic (e.g., both a total and a per-step variant), the selected counter depends on iteration order — which may vary across vLLM versions.The preferred-name sets (lines 282–301) partially address this, but
_is_accepted_token_counter/_is_proposed_token_counteralso accept fuzzy matches that are checked after the preferred names, using the same first-write-wins pattern.This is acceptable for now given the logging at lines 393-404 that announces the selection, but be aware this could silently pick a wrong counter after a vLLM upgrade. A future improvement could iterate all counters first, then select from the candidates using explicit priority.
388-391: Bothifchecks run unconditionally — if accepted and proposed counters ever share the same name, the value would be double-counted.Lines 388–389 and 390–391 are independent
ifstatements (notelif). While current logic ensures the names differ (proposed excludes"accept"and accepted excludes proposed-related keywords), this is a latent hazard if the heuristics are ever relaxed.
460-470:clear_vllm_logger_metricsresets token lists but not counter names or candidates-seen.This appears intentional (counter names are discovered once and persist), but worth confirming: after a
clear, the nextget_vllm_logger_metricscall will still return the oldspec_decode_accepted_counter_name,spec_decode_proposed_counter_name, andspec_decode_counter_candidates_seen. If the engine is reconfigured (e.g., spec-decode disabled between steps), stale counter names would persist. If this is the intended design, a brief comment would help future maintainers.
| # Reproduction Report: NeMo-RL v0.5.0 + vLLM Speculative Decoding with Draft Models | ||
|
|
||
| **Date:** 2026-02-10 | ||
| **Author:** Shaunak J | ||
| **Host:** viking-prod-228 (8x NVIDIA H20, computelab-sc-01 cluster) | ||
|
|
There was a problem hiding this comment.
Internal infrastructure details exposed in an open-source repo.
Lines 5 and 259–264 reference internal hostnames (viking-prod-228, computelab-sc-01), user scratch paths (/home/scratch.shaunakj_other/), and cluster details. If this repo is public, consider redacting or generalizing these to placeholder paths (e.g., <your-scratch-dir>).
🤖 Prompt for AI Agents
In `@docs/repro-spec-decode-build.md` around lines 1 - 6, The doc exposes internal
infra details—replace the specific hostnames and paths mentioned in the header
and body (e.g., "viking-prod-228", "computelab-sc-01", and
"/home/scratch.shaunakj_other/") with generalized placeholders like
"<host-name>" or "<your-scratch-dir>" across the "Reproduction Report: NeMo-RL
v0.5.0 + vLLM Speculative Decoding with Draft Models" document; search for those
exact strings and update them to non-sensitive placeholders and, if needed, add
a short note that readers should substitute their own environment values.
| defaults: ../../grpo_math_1B.yaml | ||
| grpo: | ||
| max_num_steps: 200 | ||
| num_prompts_per_step: 32 | ||
| num_generations_per_prompt: 16 | ||
| checkpointing: | ||
| checkpoint_dir: results/grpo-qwen2.5-7b-spec-decode-1n8g | ||
| policy: | ||
| model_name: Qwen/Qwen3-4B | ||
| tokenizer: | ||
| name: Qwen/Qwen3-4B | ||
| train_global_batch_size: 512 | ||
| train_micro_batch_size: 1 | ||
| logprob_batch_size: 1 | ||
| max_total_sequence_length: 2048 | ||
| dynamic_batching: | ||
| enabled: true | ||
| sequence_packing: | ||
| enabled: false | ||
| make_sequence_length_divisible_by: 1 | ||
| generation: | ||
| max_new_tokens: 1024 | ||
| vllm_cfg: | ||
| tensor_parallel_size: 1 | ||
| gpu_memory_utilization: 0.85 | ||
| max_model_len: 2048 | ||
| async_engine: true | ||
| enable_vllm_metrics_logger: true | ||
| vllm_metrics_logger_interval: 0.5 | ||
| vllm_kwargs: | ||
| speculative_config: | ||
| method: draft_model | ||
| model: Qwen/Qwen3-0.6B | ||
| num_speculative_tokens: 5 | ||
| draft_tensor_parallel_size: 1 | ||
| data: | ||
| max_input_seq_length: 1024 | ||
| logger: | ||
| log_dir: logs/grpo-qwen2.5-7b-spec-decode-1n8g | ||
| wandb_enabled: false | ||
| tensorboard_enabled: true | ||
| wandb: | ||
| project: nemo-rl | ||
| name: grpo-qwen2.5-7b-spec-decode-1n8g | ||
| cluster: | ||
| gpus_per_node: 8 |
There was a problem hiding this comment.
Filename/model mismatch: file says qwen2.5-7b but config uses Qwen3-4B.
The filename grpo-qwen2.5-7b-spec-decode-1n8g.yaml implies this is a Qwen 2.5 7B configuration, but the actual model_name and tokenizer (Lines 9, 11) reference Qwen/Qwen3-4B. The checkpoint dir, log dir, and wandb name also still reference qwen2.5-7b. Either rename the file and paths to reflect the actual model, or update the model/tokenizer config to match the filename.
🤖 Prompt for AI Agents
In `@examples/configs/recipes/llm/grpo-qwen2.5-7b-spec-decode-1n8g.yaml` around
lines 1 - 46, The config filename and several path/name fields reference
qwen2.5-7b but the policy.model_name and policy.tokenizer.name are set to
Qwen/Qwen3-4B; make them consistent by either (A) updating policy.model_name and
policy.tokenizer.name to the intended Qwen2.5-7b model identifier (and adjust
any vllm_kwargs/speculative model names accordingly), or (B) rename the filename
and all related identifiers (checkpoint_dir, logger.log_dir, logger.wandb.name,
and any other qwen2.5-7b strings) to reflect Qwen3-4B; ensure you also update
the speculative_config.model if using speculative decoding so it matches the
chosen base model.
| defaults: /opt/nemo-rl/examples/configs/grpo_math_1B.yaml | ||
| grpo: | ||
| max_num_steps: 200 | ||
| num_prompts_per_step: 32 | ||
| num_generations_per_prompt: 16 | ||
| checkpointing: | ||
| checkpoint_dir: results/grpo-qwen3-14b-no-spec-decode-1n8g | ||
| policy: | ||
| model_name: Qwen/Qwen3-14B | ||
| tokenizer: | ||
| name: Qwen/Qwen3-14B | ||
| train_global_batch_size: 512 | ||
| train_micro_batch_size: 1 | ||
| logprob_batch_size: 1 | ||
| max_total_sequence_length: 2048 | ||
| dynamic_batching: | ||
| enabled: true | ||
| sequence_packing: | ||
| enabled: false | ||
| make_sequence_length_divisible_by: 1 | ||
| generation: | ||
| max_new_tokens: 1024 | ||
| vllm_cfg: | ||
| tensor_parallel_size: 2 | ||
| gpu_memory_utilization: 0.80 | ||
| max_model_len: 2048 | ||
| data: | ||
| max_input_seq_length: 1024 | ||
| logger: | ||
| log_dir: logs/grpo-qwen3-14b-no-spec-decode-1n8g | ||
| wandb_enabled: false | ||
| tensorboard_enabled: true | ||
| cluster: | ||
| gpus_per_node: 8 |
There was a problem hiding this comment.
Inconsistencies with the lowbatch variant of this config.
Compared to grpo-qwen3-14b-no-spec-decode-lowbatch-1n8g.yaml, this file is missing several settings that the lowbatch variant includes:
checkpointing.enabled: falsepolicy.generation.temperaturepolicy.generation.vllm_cfg.async_enginepolicy.generation.vllm_cfg.enable_vllm_metrics_logger/vllm_metrics_logger_interval
If these omissions are intentional (relying on defaults), that's fine — but checkpointing.enabled being absent here while explicitly set in the lowbatch variant is worth double-checking to ensure the intended behavior.
🤖 Prompt for AI Agents
In `@examples/configs/recipes/llm/grpo-qwen3-14b-no-spec-decode-1n8g.yaml` around
lines 1 - 34, This config is missing several keys present in the lowbatch
variant; update the YAML to explicitly add checkpointing.enabled (e.g., false if
you want the same behavior as lowbatch) and the generation keys under policy:
policy.generation.temperature, policy.generation.vllm_cfg.async_engine, and
policy.generation.vllm_cfg.enable_vllm_metrics_logger (and
vllm_metrics_logger_interval) to match the lowbatch settings or to the intended
values; locate these settings by the symbols checkpointing.enabled,
policy.generation.temperature, policy.generation.vllm_cfg.async_engine,
policy.generation.vllm_cfg.enable_vllm_metrics_logger, and
vllm_metrics_logger_interval and add them with the desired values so behavior is
explicit rather than relying on implicit defaults.
| @@ -0,0 +1,36 @@ | |||
| defaults: /opt/nemo-rl/examples/configs/grpo_math_1B.yaml | |||
There was a problem hiding this comment.
Rename recipe file to match the LLM recipe naming convention.
The node/GPU segment is currently at the end of the filename; the repo pattern expects it immediately after <model>. Consider renaming to something like grpo-qwen3-32b-1n8g-no-spec-decode-lowbatch.yaml to keep tooling consistent.
As per coding guidelines: For LLM recipes, follow the naming pattern: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].(yaml|sh).
🤖 Prompt for AI Agents
In
`@examples/configs/recipes/llm/grpo-qwen3-32b-no-spec-decode-lowbatch-1n8g.yaml`
at line 1, The recipe filename violates the LLM naming convention by placing the
node/GPU segment at the end; rename the file from
grpo-qwen3-32b-no-spec-decode-lowbatch-1n8g.yaml to follow the pattern used by
other LLM recipes (e.g., grpo-qwen3-32b-1n8g-no-spec-decode-lowbatch.yaml) so
the node/GPU segment appears immediately after the model, ensuring consistency
with the naming rule
"<algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].(yaml|sh)"
and update any references to this filename in tooling or docs (search for
occurrences of grpo-qwen3-32b-no-spec-decode-lowbatch-1n8g.yaml and replace).
| max_new_tokens: 4096 | ||
| temperature: 0.6 | ||
| vllm_cfg: | ||
| tensor_parallel_size: 1 |
There was a problem hiding this comment.
tensor_parallel_size: 1 may be tight for a 32B model.
A 32B parameter model in BF16 requires ~64 GB just for weights. With gpu_memory_utilization: 0.85, the available budget on an 80 GB GPU (~68 GB) leaves very little room for KV cache and the draft model. This will work on 96 GB GPUs (H20) but may OOM on H100 80 GB. Consider documenting the minimum GPU memory requirement or bumping TP if this config is intended to be general-purpose.
🤖 Prompt for AI Agents
In `@examples/configs/recipes/llm/grpo-qwen3-32b-spec-decode-lowbatch-1n8g.yaml`
at line 26, The config sets tensor_parallel_size: 1 for a 32B model which is
likely to OOM on 80GB GPUs when gpu_memory_utilization is 0.85; update the YAML
to either document a minimum GPU memory requirement (mentioning 96GB/H20 or
similar) or increase tensor_parallel_size (e.g., to 2 or more) so model weights
fit alongside KV cache and draft model—adjust the comments and any README
references for this example and ensure the fields tensor_parallel_size and
gpu_memory_utilization in the grpo-qwen3-32b-spec-decode-lowbatch-1n8g.yaml
reflect the new recommended values or include the minimum-GPU note.
| def _counter_delta_over_step(counter_series: list[Any]) -> float: | ||
| """Estimate step-local increments from a cumulative counter timeline.""" | ||
| parsed_values = [] | ||
| for value in counter_series: | ||
| try: | ||
| parsed_values.append(float(value)) | ||
| except (TypeError, ValueError): | ||
| continue | ||
|
|
||
| if len(parsed_values) < 2: | ||
| return 0.0 | ||
|
|
||
| delta = parsed_values[-1] - parsed_values[0] | ||
| # If the counter resets during the step, fallback to the latest counter value. | ||
| if delta < 0: | ||
| return max(0.0, parsed_values[-1]) | ||
| return delta | ||
|
|
||
|
|
||
| def compute_spec_decode_token_acceptance_metrics( | ||
| generation_logger_metrics: dict[str, Any], | ||
| ) -> dict[str, float]: | ||
| """Compute speculative decoding token acceptance metrics from logger timelines. | ||
|
|
||
| Returns: | ||
| A dictionary with accepted/proposed draft token deltas and, when available, | ||
| acceptance rate. If proposed tokens are present but the accepted-token | ||
| counter is unavailable, acceptance rate is intentionally omitted so callers | ||
| can report `N/A` instead of a misleading `0.0`. | ||
| """ |
There was a problem hiding this comment.
Add Google‑style docstrings for the new helpers.
Both _counter_delta_over_step and compute_spec_decode_token_acceptance_metrics are missing Args (and _counter_delta_over_step missing Returns) sections required by the repo docstring standard.
✍️ Docstring update
def _counter_delta_over_step(counter_series: list[Any]) -> float:
- """Estimate step-local increments from a cumulative counter timeline."""
+ """Estimate step-local increments from a cumulative counter timeline.
+
+ Args:
+ counter_series: Timeline of cumulative counter values for a step.
+
+ Returns:
+ Estimated step-local increment.
+ """ def compute_spec_decode_token_acceptance_metrics(
generation_logger_metrics: dict[str, Any],
) -> dict[str, float]:
- """Compute speculative decoding token acceptance metrics from logger timelines.
-
- Returns:
- A dictionary with accepted/proposed draft token deltas and, when available,
- acceptance rate. If proposed tokens are present but the accepted-token
- counter is unavailable, acceptance rate is intentionally omitted so callers
- can report `N/A` instead of a misleading `0.0`.
- """
+ """Compute speculative decoding token acceptance metrics from logger timelines.
+
+ Args:
+ generation_logger_metrics: Per-worker vLLM logger timelines.
+
+ Returns:
+ A dictionary with accepted/proposed draft token deltas and, when available,
+ acceptance rate. If proposed tokens are present but the accepted-token
+ counter is unavailable, acceptance rate is intentionally omitted so callers
+ can report `N/A` instead of a misleading `0.0`.
+ """Also applies to: 405-415
🤖 Prompt for AI Agents
In `@nemo_rl/algorithms/utils.py` around lines 386 - 415, Both helper functions
lack Google-style docstring sections: add an "Args" section to
_counter_delta_over_step(counter_series) describing counter_series: list[Any]
and its expected content/type, and add both "Args" and "Returns" to
compute_spec_decode_token_acceptance_metrics(generation_logger_metrics)
describing generation_logger_metrics: dict[str, Any] and the returned dict[str,
float]; ensure the docstrings follow the repo's Google style (param
descriptions, return description, types) and keep existing summary lines intact
for _counter_delta_over_step and compute_spec_decode_token_acceptance_metrics.
| # sudo apt-get install libibverbs-dev | ||
| "deep_ep @ git+https://github.com/deepseek-ai/DeepEP.git@bfded34800dfec415b71503f8205181de90b2480", | ||
| "vllm==0.11.2", | ||
| "vllm>=0.15.1,<0.16", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find all vLLM imports in the codebase
rg -n --type=py 'from vllm|import vllm' -g '!**/test/**' -g '!**/.venv/**' -g '!**/3rdparty/**' | head -100Repository: NVIDIA-NeMo/RL
Length of output: 4382
🏁 Script executed:
#!/bin/bash
# Check git history to see what the original vllm version was
git log -p --follow -S 'vllm' -- pyproject.toml | grep -A 5 -B 5 'vllm' | head -80Repository: NVIDIA-NeMo/RL
Length of output: 40
🏁 Script executed:
#!/bin/bash
# Check the build script to see the git ref and how it's used
cat tools/build-custom-vllm.sh | head -150Repository: NVIDIA-NeMo/RL
Length of output: 6116
🌐 Web query:
vLLM v1 API introduction version history 0.14 0.15
💡 Result:
What “vLLM V1” refers to (engine + serving stack)
- V1 (the re-architected core engine) was announced as an alpha on Jan 27, 2025, enabled via
VLLM_USE_V1=1, while keeping the existing Python/OpenAI-compatible surface APIs unchanged. [1] - V1 became the default engine in vLLM
v0.8.0(per the vLLM maintainers’ deprecation RFC for V0). [2]
Version history around v0.14 → v0.15 (Jan 2026)
Release dates (PyPI):
- v0.14.0 — Jan 20, 2026; v0.14.1 — Jan 24, 2026; v0.15.0 — Jan 29, 2026; v0.15.1 — Feb 5, 2026. [3]
vLLM v0.14.0 (Jan 20, 2026) — notable “API/serving” items
- Added/expanded API & Frontend features including gRPC server, offline FastAPI docs,
LLM(..., attention_config=...),/server_info, and MCP streaming in the Responses API (plus various CLI/API improvements). [4]
vLLM v0.14.1 (Jan 24, 2026)
- Patch release: security + memory leak fixes on top of
v0.14.0. [5]
vLLM v0.15.0 (Jan 29, 2026) — notable “v1/* OpenAI-style API” items
- Responses API improvements (e.g., partial message generation;
prompt_cache_key; tuning forinclude_stop_str_in_output). [6] - OpenAI API tweak:
skip_special_tokensconfiguration. [6] - Added/updated other endpoints such as score endpoint input formats and render endpoints for prompt preprocessing. [6]
vLLM v0.15.1 (Feb 5, 2026)
- Patch release: security fixes + GPU (RTX Blackwell) fixes + bug fixes. [7]
Sources: vLLM blog post on V1 alpha [1]; vLLM GitHub RFC on V0 deprecation/V1 default timeline [2]; PyPI release history [3]; GitHub release notes for v0.14.0 [4], v0.14.1 [5], v0.15.0 [6], v0.15.1 [7].
🏁 Script executed:
#!/bin/bash
# Check the current pyproject.toml to confirm the vllm version constraint
cat pyproject.toml | grep -A 5 -B 5 'vllm'Repository: NVIDIA-NeMo/RL
Length of output: 2511
Verify vLLM v1 API availability across 0.14.x and 0.15.x versions.
The dependency moves from vllm==0.11.2 to vllm>=0.15.1,<0.16. The codebase imports from vllm.v1.* (e.g., vllm.v1.engine.async_llm, vllm.v1.metrics.*, vllm.v1.executor.*), which exist in both v0.14.x and v0.15.x since the v1 API was introduced in v0.8.0.
However, the custom build script (tools/build-custom-vllm.sh) produces a v0.14.0rc2 dev build while PyPI users will install v0.15.1+. Although the build script unpins vLLM in pyproject.toml during custom installs, ensure all used vLLM APIs remain compatible across both versions to support both installation paths.
🤖 Prompt for AI Agents
In `@pyproject.toml` at line 79, Dependency change to "vllm>=0.15.1,<0.16" may
mismatch the custom build v0.14.0rc2; audit all imports from the v1 API (e.g.,
vllm.v1.engine.async_llm, vllm.v1.metrics.*, vllm.v1.executor.*) to ensure the
specific functions/classes we use exist and behave identically in both v0.14.x
and v0.15.x, run unit/integration tests against both versions, and if any
incompatibility is found either pin pyproject.toml to a compatible range or add
conditional imports/shims (or adjust tools/build-custom-vllm.sh to produce a
matching v0.15.x dev build) so both PyPI installs and custom builds work
consistently.
| def test_compute_spec_decode_token_acceptance_metrics_normal_case(): | ||
| generation_logger_metrics = { | ||
| "spec_decode_accepted_tokens": {0: [10, 28], 1: [3, 11]}, | ||
| "spec_decode_proposed_tokens": {0: [20, 50], 1: [8, 24]}, | ||
| } | ||
|
|
||
| metrics = compute_spec_decode_token_acceptance_metrics(generation_logger_metrics) | ||
|
|
||
| assert metrics["accepted_draft_tokens"] == 26.0 | ||
| assert metrics["proposed_draft_tokens"] == 46.0 | ||
| assert metrics["token_acceptance_rate"] == 26.0 / 46.0 | ||
| assert metrics["spec_decode_accepted_counter_found"] == 1.0 | ||
| assert metrics["spec_decode_proposed_counter_found"] == 1.0 | ||
|
|
||
|
|
||
| def test_compute_spec_decode_token_acceptance_metrics_missing_accepted_counter(): | ||
| generation_logger_metrics = { | ||
| "spec_decode_accepted_tokens": {}, | ||
| "spec_decode_proposed_tokens": {0: [120, 170]}, | ||
| } | ||
|
|
||
| metrics = compute_spec_decode_token_acceptance_metrics(generation_logger_metrics) | ||
|
|
||
| assert metrics["accepted_draft_tokens"] == 0.0 | ||
| assert metrics["proposed_draft_tokens"] == 50.0 | ||
| assert "token_acceptance_rate" not in metrics | ||
| assert metrics["spec_decode_accepted_counter_found"] == 0.0 | ||
| assert metrics["spec_decode_proposed_counter_found"] == 1.0 |
There was a problem hiding this comment.
Add Google‑style docstrings to the new tests.
The new test functions are missing docstrings while the rest of the module uses them.
✍️ Suggested docstrings
def test_compute_spec_decode_token_acceptance_metrics_normal_case():
+ """Verify acceptance metrics for normal counter timelines."""
generation_logger_metrics = {
"spec_decode_accepted_tokens": {0: [10, 28], 1: [3, 11]},
"spec_decode_proposed_tokens": {0: [20, 50], 1: [8, 24]},
}
@@
def test_compute_spec_decode_token_acceptance_metrics_missing_accepted_counter():
+ """Verify behavior when the accepted counter is unavailable."""
generation_logger_metrics = {
"spec_decode_accepted_tokens": {},
"spec_decode_proposed_tokens": {0: [120, 170]},
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def test_compute_spec_decode_token_acceptance_metrics_normal_case(): | |
| generation_logger_metrics = { | |
| "spec_decode_accepted_tokens": {0: [10, 28], 1: [3, 11]}, | |
| "spec_decode_proposed_tokens": {0: [20, 50], 1: [8, 24]}, | |
| } | |
| metrics = compute_spec_decode_token_acceptance_metrics(generation_logger_metrics) | |
| assert metrics["accepted_draft_tokens"] == 26.0 | |
| assert metrics["proposed_draft_tokens"] == 46.0 | |
| assert metrics["token_acceptance_rate"] == 26.0 / 46.0 | |
| assert metrics["spec_decode_accepted_counter_found"] == 1.0 | |
| assert metrics["spec_decode_proposed_counter_found"] == 1.0 | |
| def test_compute_spec_decode_token_acceptance_metrics_missing_accepted_counter(): | |
| generation_logger_metrics = { | |
| "spec_decode_accepted_tokens": {}, | |
| "spec_decode_proposed_tokens": {0: [120, 170]}, | |
| } | |
| metrics = compute_spec_decode_token_acceptance_metrics(generation_logger_metrics) | |
| assert metrics["accepted_draft_tokens"] == 0.0 | |
| assert metrics["proposed_draft_tokens"] == 50.0 | |
| assert "token_acceptance_rate" not in metrics | |
| assert metrics["spec_decode_accepted_counter_found"] == 0.0 | |
| assert metrics["spec_decode_proposed_counter_found"] == 1.0 | |
| def test_compute_spec_decode_token_acceptance_metrics_normal_case(): | |
| """Verify acceptance metrics for normal counter timelines.""" | |
| generation_logger_metrics = { | |
| "spec_decode_accepted_tokens": {0: [10, 28], 1: [3, 11]}, | |
| "spec_decode_proposed_tokens": {0: [20, 50], 1: [8, 24]}, | |
| } | |
| metrics = compute_spec_decode_token_acceptance_metrics(generation_logger_metrics) | |
| assert metrics["accepted_draft_tokens"] == 26.0 | |
| assert metrics["proposed_draft_tokens"] == 46.0 | |
| assert metrics["token_acceptance_rate"] == 26.0 / 46.0 | |
| assert metrics["spec_decode_accepted_counter_found"] == 1.0 | |
| assert metrics["spec_decode_proposed_counter_found"] == 1.0 | |
| def test_compute_spec_decode_token_acceptance_metrics_missing_accepted_counter(): | |
| """Verify behavior when the accepted counter is unavailable.""" | |
| generation_logger_metrics = { | |
| "spec_decode_accepted_tokens": {}, | |
| "spec_decode_proposed_tokens": {0: [120, 170]}, | |
| } | |
| metrics = compute_spec_decode_token_acceptance_metrics(generation_logger_metrics) | |
| assert metrics["accepted_draft_tokens"] == 0.0 | |
| assert metrics["proposed_draft_tokens"] == 50.0 | |
| assert "token_acceptance_rate" not in metrics | |
| assert metrics["spec_decode_accepted_counter_found"] == 0.0 | |
| assert metrics["spec_decode_proposed_counter_found"] == 1.0 |
🤖 Prompt for AI Agents
In `@tests/unit/algorithms/test_utils.py` around lines 599 - 626, Add Google-style
docstrings to the two test functions
test_compute_spec_decode_token_acceptance_metrics_normal_case and
test_compute_spec_decode_token_acceptance_metrics_missing_accepted_counter: each
docstring should be a short one-line summary of what the test verifies, followed
by an "Args:" section (None) and a "Returns:" section (None) to match existing
module style so Sphinx can parse them; place the docstring immediately under the
def line for each test and keep wording concise and consistent with other tests
in the file.
| GIT_REF=${2:-4a5299c93ff97c26def537b92562df5ada530fea} | ||
| # NOTE: VLLM_USE_PRECOMPILED=1 didn't always seem to work since the wheels were sometimes built against an incompatible torch/cuda combo. | ||
| # This commit was chosen as one close to the v0.10 release: git merge-base --fork-point origin/main tags/v0.10.0 | ||
| VLLM_WHEEL_COMMIT=${3:-862f2ef893d9751db0a92bd2d4ae0e3d9677872f} # use full commit hash from the main branch | ||
| export VLLM_PRECOMPILED_WHEEL_LOCATION="https://wheels.vllm.ai/${VLLM_WHEEL_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl" | ||
| VLLM_WHEEL_COMMIT=${3:-4a5299c93ff97c26def537b92562df5ada530fea} # merge commit of vllm PR #24322 (spec decode with draft models) | ||
| export VLLM_PRECOMPILED_WHEEL_LOCATION="${VLLM_PRECOMPILED_WHEEL_LOCATION:-https://wheels.vllm.ai/${VLLM_WHEEL_COMMIT}/vllm-0.14.0rc2.dev156%2Bg4a5299c93-cp38-abi3-manylinux_2_31_x86_64.whl}" |
There was a problem hiding this comment.
Stale comment on line 26 and version mismatch with pyproject.toml.
Two issues:
-
Stale comment (line 26): The note "This commit was chosen as one close to the v0.10 release" no longer applies — the commit is now from vLLM PR
#24322(spec decode with draft models), producing a0.14.0rc2dev build. Update or remove the stale comment to avoid confusion. -
Version mismatch: The wheel filename embeds version
0.14.0rc2.dev156, butpyproject.tomlrequiresvllm>=0.15.1,<0.16. While the build script unpins vllm in pyproject.toml before installing, this means the custom-build path produces a vLLM version that would not satisfy the standard dependency constraint. This is worth documenting explicitly (e.g., a comment nearVLLM_WHEEL_COMMIT) so users understand the custom build is intentionally outside the pinned range.
Suggested comment update
-# NOTE: VLLM_USE_PRECOMPILED=1 didn't always seem to work since the wheels were sometimes built against an incompatible torch/cuda combo.
-# This commit was chosen as one close to the v0.10 release: git merge-base --fork-point origin/main tags/v0.10.0
-VLLM_WHEEL_COMMIT=${3:-4a5299c93ff97c26def537b92562df5ada530fea} # merge commit of vllm PR `#24322` (spec decode with draft models)
+# NOTE: VLLM_USE_PRECOMPILED=1 didn't always seem to work since the wheels were sometimes built against an incompatible torch/cuda combo.
+# This commit produces a 0.14.0rc2 dev wheel, which is intentionally outside the >=0.15.1 range in pyproject.toml.
+# The build script unpins the vllm constraint to allow this local editable install.
+VLLM_WHEEL_COMMIT=${3:-4a5299c93ff97c26def537b92562df5ada530fea} # merge commit of vllm PR `#24322` (spec decode with draft models)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| GIT_REF=${2:-4a5299c93ff97c26def537b92562df5ada530fea} | |
| # NOTE: VLLM_USE_PRECOMPILED=1 didn't always seem to work since the wheels were sometimes built against an incompatible torch/cuda combo. | |
| # This commit was chosen as one close to the v0.10 release: git merge-base --fork-point origin/main tags/v0.10.0 | |
| VLLM_WHEEL_COMMIT=${3:-862f2ef893d9751db0a92bd2d4ae0e3d9677872f} # use full commit hash from the main branch | |
| export VLLM_PRECOMPILED_WHEEL_LOCATION="https://wheels.vllm.ai/${VLLM_WHEEL_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl" | |
| VLLM_WHEEL_COMMIT=${3:-4a5299c93ff97c26def537b92562df5ada530fea} # merge commit of vllm PR #24322 (spec decode with draft models) | |
| export VLLM_PRECOMPILED_WHEEL_LOCATION="${VLLM_PRECOMPILED_WHEEL_LOCATION:-https://wheels.vllm.ai/${VLLM_WHEEL_COMMIT}/vllm-0.14.0rc2.dev156%2Bg4a5299c93-cp38-abi3-manylinux_2_31_x86_64.whl}" | |
| GIT_REF=${2:-4a5299c93ff97c26def537b92562df5ada530fea} | |
| # NOTE: VLLM_USE_PRECOMPILED=1 didn't always seem to work since the wheels were sometimes built against an incompatible torch/cuda combo. | |
| # This commit produces a 0.14.0rc2 dev wheel, which is intentionally outside the >=0.15.1 range in pyproject.toml. | |
| # The build script unpins the vllm constraint to allow this local editable install. | |
| VLLM_WHEEL_COMMIT=${3:-4a5299c93ff97c26def537b92562df5ada530fea} # merge commit of vllm PR `#24322` (spec decode with draft models) | |
| export VLLM_PRECOMPILED_WHEEL_LOCATION="${VLLM_PRECOMPILED_WHEEL_LOCATION:-https://wheels.vllm.ai/${VLLM_WHEEL_COMMIT}/vllm-0.14.0rc2.dev156%2Bg4a5299c93-cp38-abi3-manylinux_2_31_x86_64.whl}" |
🤖 Prompt for AI Agents
In `@tools/build-custom-vllm.sh` around lines 24 - 28, Update the stale note and
add an explicit warning about the version mismatch: replace or remove the old
"close to the v0.10 release" comment near GIT_REF/VLLM_WHEEL_COMMIT and instead
add a short comment next to VLLM_WHEEL_COMMIT (and/or
VLLM_PRECOMPILED_WHEEL_LOCATION) stating this commit/build produces a vllm wheel
tagged 0.14.0rc2.dev156 (from PR `#24322`) which is intentionally outside the
normal pyproject.toml constraint (pyproject.toml requires vllm>=0.15.1,<0.16)
and that the script unpins vllm before install when using the custom wheel;
reference the variables GIT_REF, VLLM_WHEEL_COMMIT and
VLLM_PRECOMPILED_WHEEL_LOCATION to locate where to update the comments.
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
Release Notes
New Features
Documentation
Dependencies