Skip to content

Feat: per-message tokenization for prefix cache hits#1863

Open
samsja wants to merge 3 commits intomainfrom
feature/per-message-tokenization
Open

Feat: per-message tokenization for prefix cache hits#1863
samsja wants to merge 3 commits intomainfrom
feature/per-message-tokenization

Conversation

@samsja
Copy link
Member

@samsja samsja commented Feb 24, 2026

Summary

When serving models via vLLM's OpenAI-compatible chat API in multi-turn workflows, prefix caching breaks silently because BPE re-tokenization of assistant responses produces different token IDs than the original AR decoding (e.g. "\n"+"\n"[198, 198] during AR vs "\n\n"[271] during re-encode).

This PR adds an opt-in monkey patch (per_message_tokenization = true) that:

  • Caches AR-generated token IDs after each non-streaming completion
  • Splices cached tokens back in during subsequent turns, so re-tokenization always matches the original generation

This ensures prefix cache hits across turns and preserves the extension property for RL training (all turns merge into a single sample instead of splitting).

Full design, usage, and known limitations are documented in docs/per-message-tokenization.md.

Changes

  • src/prime_rl/inference/patches.py — two monkey patches (cache population + per-message tokenization)
  • src/prime_rl/inference/config.pyper_message_tokenization flag
  • src/prime_rl/inference/server.py — conditional patch application
  • docs/per-message-tokenization.md — problem description, fix, limitations
  • configs/benchmark/ + scripts/ — benchmark config & script (wordle, TITO disabled)

Usage

[inference]
per_message_tokenization = true

Test plan

  • Run wordle benchmark with per_message_tokenization = false (baseline) — expect samples_per_rollout > 1
  • Run wordle benchmark with per_message_tokenization = true — expect samples_per_rollout = 1
  • Verify no regression on single-turn workloads

🤖 Generated with Claude Code


Note

Medium Risk
Introduces a vLLM monkey-patch that changes how chat prompts are tokenized and cached, which can affect correctness/performance in multi-turn serving and has heuristic-based matching and new in-process memory usage.

Overview
Adds an opt-in per_message_tokenization inference flag that monkey-patches vLLM to cache AR-generated output.text -> token_ids (non-streaming) and reuse those tokens when re-rendering multi-turn chat prompts, preventing BPE merges across message boundaries and improving prefix-cache hit rates.

The inference server conditionally applies this patch at startup, and the PR adds a benchmark config + script to compare baseline vs patched behavior plus a new doc describing the approach and known limitations.

Written by Cursor Bugbot for commit 9457eca. This will update automatically on new commits. Configure here.

samsja and others added 3 commits February 24, 2026 02:15
…rn chat

vLLM re-tokenizes the full conversation from scratch each turn. BPE encoding
can produce different token IDs than what AR decoding originally generated
(non-canonical tokenization), breaking prefix cache hits silently.

Two patches:
1. Cache AR-generated token IDs after each completion
2. Splice cached tokens back in when the same assistant content appears in
   subsequent turns, tokenizing each message segment independently

Enabled via `per_message_tokenization = true` in the inference config.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Runs wordle RL with TITO disabled to compare samples_per_rollout between
baseline (standard tokenization) and per-message tokenization (AR token
caching). samples_per_rollout > 1 indicates tokenization mismatch broke
the extension property across turns.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove invalid ckpt.enabled field
- Fix boolean CLI flag syntax (--no-inference.per-message-tokenization)
- Adjust GPU layout and seq_len for RTX 3090s

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

"Improves prefix cache hit rates in multi-turn conversations by ensuring re-tokenization of "
"previous turns always produces the same token IDs.",
),
] = False
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing CHANGELOG entry for new config field

Low Severity

A new config field per_message_tokenization is added to src/prime_rl/inference/config.py but CHANGELOG.md is not updated. Per project rules, any PR that modifies configuration structures (added, removed, renamed, moved, or default value changes) in src/prime_rl/*/config.py must include a corresponding CHANGELOG.md entry.

Fix in Cursor Fix in Web

Triggered by project rule: BugBot Instructions

if config.per_message_tokenization:
from prime_rl.inference.patches import monkey_patch_per_message_tokenization

monkey_patch_per_message_tokenization()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patch not applied in multi-API-server worker processes

High Severity

monkey_patch_per_message_tokenization() is called only in main(), not at module level in vllm/server.py like the other four patches. When api_server_count > 1, spawned worker processes re-import the module to pick up module-level patches but never call this one. The benchmark config itself sets dp = 2, which auto-bumps api_server_count to 2, so the feature silently does nothing in the very scenario it's designed for — all requests are handled by unpatched worker processes.

Additional Locations (1)

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant