Feat: per-message tokenization for prefix cache hits by samsja · Pull Request #1863 · PrimeIntellect-ai/prime-rl

samsja · 2026-02-24T03:57:52Z

Summary

When serving models via vLLM's OpenAI-compatible chat API in multi-turn workflows, prefix caching breaks silently because BPE re-tokenization of assistant responses produces different token IDs than the original AR decoding (e.g. "\n"+"\n" → [198, 198] during AR vs "\n\n" → [271] during re-encode).

This PR adds an opt-in monkey patch (per_message_tokenization = true) that:

Caches AR-generated token IDs after each non-streaming completion
Splices cached tokens back in during subsequent turns, so re-tokenization always matches the original generation

This ensures prefix cache hits across turns and preserves the extension property for RL training (all turns merge into a single sample instead of splitting).

Full design, usage, and known limitations are documented in docs/per-message-tokenization.md.

Changes

src/prime_rl/inference/patches.py — two monkey patches (cache population + per-message tokenization)
src/prime_rl/inference/config.py — per_message_tokenization flag
src/prime_rl/inference/server.py — conditional patch application
docs/per-message-tokenization.md — problem description, fix, limitations
configs/benchmark/ + scripts/ — benchmark config & script (wordle, TITO disabled)

Usage

[inference]
per_message_tokenization = true

Test plan

Run wordle benchmark with per_message_tokenization = false (baseline) — expect samples_per_rollout > 1
Run wordle benchmark with per_message_tokenization = true — expect samples_per_rollout = 1
Verify no regression on single-turn workloads

🤖 Generated with Claude Code

Note

Medium Risk
Introduces a vLLM monkey-patch that changes how chat prompts are tokenized and cached, which can affect correctness/performance in multi-turn serving and has heuristic-based matching and new in-process memory usage.

Overview
Adds an opt-in per_message_tokenization inference flag that monkey-patches vLLM to cache AR-generated output.text -> token_ids (non-streaming) and reuse those tokens when re-rendering multi-turn chat prompts, preventing BPE merges across message boundaries and improving prefix-cache hit rates.

The inference server conditionally applies this patch at startup, and the PR adds a benchmark config + script to compare baseline vs patched behavior plus a new doc describing the approach and known limitations.

^{Written by Cursor Bugbot for commit 9457eca. This will update automatically on new commits. Configure here.}

…rn chat vLLM re-tokenizes the full conversation from scratch each turn. BPE encoding can produce different token IDs than what AR decoding originally generated (non-canonical tokenization), breaking prefix cache hits silently. Two patches: 1. Cache AR-generated token IDs after each completion 2. Splice cached tokens back in when the same assistant content appears in subsequent turns, tokenizing each message segment independently Enabled via `per_message_tokenization = true` in the inference config. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Runs wordle RL with TITO disabled to compare samples_per_rollout between baseline (standard tokenization) and per-message tokenization (AR token caching). samples_per_rollout > 1 indicates tokenization mismatch broke the extension property across turns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove invalid ckpt.enabled field - Fix boolean CLI flag syntax (--no-inference.per-message-tokenization) - Adjust GPU layout and seq_len for RTX 3090s Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-24T04:09:41Z

src/prime_rl/inference/config.py

+            "Improves prefix cache hit rates in multi-turn conversations by ensuring re-tokenization of "
+            "previous turns always produces the same token IDs.",
+        ),
+    ] = False


Missing CHANGELOG entry for new config field

Low Severity

A new config field per_message_tokenization is added to src/prime_rl/inference/config.py but CHANGELOG.md is not updated. Per project rules, any PR that modifies configuration structures (added, removed, renamed, moved, or default value changes) in src/prime_rl/*/config.py must include a corresponding CHANGELOG.md entry.

^{Triggered by project rule: BugBot Instructions}

cursor · 2026-02-24T04:09:41Z

src/prime_rl/inference/server.py

+    if config.per_message_tokenization:
+        from prime_rl.inference.patches import monkey_patch_per_message_tokenization
+
+        monkey_patch_per_message_tokenization()


Patch not applied in multi-API-server worker processes

High Severity

monkey_patch_per_message_tokenization() is called only in main(), not at module level in vllm/server.py like the other four patches. When api_server_count > 1, spawned worker processes re-import the module to pick up module-level patches but never call this one. The benchmark config itself sets dp = 2, which auto-bumps api_server_count to 2, so the feature silently does nothing in the very scenario it's designed for — all requests are handled by unpatched worker processes.

Additional Locations (1)

configs/benchmark/per_message_tokenization.toml#L31-L33

samsja and others added 3 commits February 24, 2026 02:15

fix: benchmark config and script for 24GB GPUs

9457eca

- Remove invalid ckpt.enabled field - Fix boolean CLI flag syntax (--no-inference.per-message-tokenization) - Adjust GPU layout and seq_len for RTX 3090s Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor bot reviewed Feb 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: per-message tokenization for prefix cache hits#1863

Feat: per-message tokenization for prefix cache hits#1863
samsja wants to merge 3 commits intomainfrom
feature/per-message-tokenization

samsja commented Feb 24, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 24, 2026

Uh oh!

cursor bot Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

samsja commented Feb 24, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Usage

Test plan

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 24, 2026

Choose a reason for hiding this comment

Missing CHANGELOG entry for new config field

Uh oh!

cursor bot Feb 24, 2026

Choose a reason for hiding this comment

Patch not applied in multi-API-server worker processes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

samsja commented Feb 24, 2026 •

edited by cursor bot

Loading