Feat: per-message tokenization for prefix cache hits#1863
Feat: per-message tokenization for prefix cache hits#1863
Conversation
…rn chat vLLM re-tokenizes the full conversation from scratch each turn. BPE encoding can produce different token IDs than what AR decoding originally generated (non-canonical tokenization), breaking prefix cache hits silently. Two patches: 1. Cache AR-generated token IDs after each completion 2. Splice cached tokens back in when the same assistant content appears in subsequent turns, tokenizing each message segment independently Enabled via `per_message_tokenization = true` in the inference config. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Runs wordle RL with TITO disabled to compare samples_per_rollout between baseline (standard tokenization) and per-message tokenization (AR token caching). samples_per_rollout > 1 indicates tokenization mismatch broke the extension property across turns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove invalid ckpt.enabled field - Fix boolean CLI flag syntax (--no-inference.per-message-tokenization) - Adjust GPU layout and seq_len for RTX 3090s Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| "Improves prefix cache hit rates in multi-turn conversations by ensuring re-tokenization of " | ||
| "previous turns always produces the same token IDs.", | ||
| ), | ||
| ] = False |
There was a problem hiding this comment.
Missing CHANGELOG entry for new config field
Low Severity
A new config field per_message_tokenization is added to src/prime_rl/inference/config.py but CHANGELOG.md is not updated. Per project rules, any PR that modifies configuration structures (added, removed, renamed, moved, or default value changes) in src/prime_rl/*/config.py must include a corresponding CHANGELOG.md entry.
Triggered by project rule: BugBot Instructions
| if config.per_message_tokenization: | ||
| from prime_rl.inference.patches import monkey_patch_per_message_tokenization | ||
|
|
||
| monkey_patch_per_message_tokenization() |
There was a problem hiding this comment.
Patch not applied in multi-API-server worker processes
High Severity
monkey_patch_per_message_tokenization() is called only in main(), not at module level in vllm/server.py like the other four patches. When api_server_count > 1, spawned worker processes re-import the module to pick up module-level patches but never call this one. The benchmark config itself sets dp = 2, which auto-bumps api_server_count to 2, so the feature silently does nothing in the very scenario it's designed for — all requests are handled by unpatched worker processes.


Summary
When serving models via vLLM's OpenAI-compatible chat API in multi-turn workflows, prefix caching breaks silently because BPE re-tokenization of assistant responses produces different token IDs than the original AR decoding (e.g.
"\n"+"\n"→[198, 198]during AR vs"\n\n"→[271]during re-encode).This PR adds an opt-in monkey patch (
per_message_tokenization = true) that:This ensures prefix cache hits across turns and preserves the extension property for RL training (all turns merge into a single sample instead of splitting).
Full design, usage, and known limitations are documented in docs/per-message-tokenization.md.
Changes
src/prime_rl/inference/patches.py— two monkey patches (cache population + per-message tokenization)src/prime_rl/inference/config.py—per_message_tokenizationflagsrc/prime_rl/inference/server.py— conditional patch applicationdocs/per-message-tokenization.md— problem description, fix, limitationsconfigs/benchmark/+scripts/— benchmark config & script (wordle, TITO disabled)Usage
Test plan
per_message_tokenization = false(baseline) — expectsamples_per_rollout > 1per_message_tokenization = true— expectsamples_per_rollout = 1🤖 Generated with Claude Code
Note
Medium Risk
Introduces a vLLM monkey-patch that changes how chat prompts are tokenized and cached, which can affect correctness/performance in multi-turn serving and has heuristic-based matching and new in-process memory usage.
Overview
Adds an opt-in
per_message_tokenizationinference flag that monkey-patches vLLM to cache AR-generatedoutput.text -> token_ids(non-streaming) and reuse those tokens when re-rendering multi-turn chat prompts, preventing BPE merges across message boundaries and improving prefix-cache hit rates.The inference server conditionally applies this patch at startup, and the PR adds a benchmark config + script to compare baseline vs patched behavior plus a new doc describing the approach and known limitations.
Written by Cursor Bugbot for commit 9457eca. This will update automatically on new commits. Configure here.