implement self-verification support by faresobeid · Pull Request #1890 · PrimeIntellect-ai/prime-rl

faresobeid · 2026-02-25T19:32:38Z

Save previous rollout history
Allow for sampling from rollout history when we have dynamic prompt mode = self-verification
dynamic prompt mode also would allow for envs that can generate prompts on the fly like logic tasks, or pairwise stuff

Note

Medium Risk
Changes training-time sampling behavior and checkpoint contents; incorrect history sampling/exclusions could bias data or break assumptions about example immutability across rollouts.

Overview
Adds self-verification dynamic prompting by persisting a non-destructive rollout_history in the orchestrator Buffer, saving/loading it in checkpoints, and exposing sample_random_history_rollout() for random historical sampling.

Updates the rollout Scheduler to deepcopy sampled examples and, when example["info"].dynamic_prompt_mode == "self_verification", inject a sampled historical rollout (task/example_id/reward/prompt/completion) into info.self_verification.source, with optional task exclusions to avoid recursive self-verification.

Also includes a tiny formatting-only tweak to an OrchestratorConfig validation error message.

^{Written by Cursor Bugbot for commit e3f31a6. This will update automatically on new commits. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 4 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-25T20:07:46Z

src/prime_rl/orchestrator/buffer.py

+    def sample_random_history_rollout(self, exclude_tasks: set[str] | None = None) -> vf.RolloutOutput | None:
+        """Sample one random historical rollout without removing it."""
+        exclude_tasks = exclude_tasks or set()
+        eligible = [rollout for rollout in self.rollout_history if rollout["task"] not in exclude_tasks]


Task type mismatch in exclude filter

Medium Severity

In sample_random_history_rollout, the filter uses rollout["task"] not in exclude_tasks, but exclude_tasks is built with str(task) in the scheduler. If rollout["task"] is not a string (e.g. int from JSON or dataset), the comparison fails and rollouts that should be excluded remain eligible for sampling.

cursor · 2026-02-25T20:07:46Z

src/prime_rl/orchestrator/buffer.py


    def update(self, rollouts: list[vf.RolloutOutput]):
        """Updates the buffer state with completed rollouts."""
+        self.rollout_history.extend(rollouts)


Unbounded rollout history memory growth

Medium Severity

rollout_history is extended with every batch of completed rollouts in update() but is never trimmed. In long training runs this can lead to unbounded memory growth and OOM, since each rollout stores prompt, completion, and other fields.

cursor · 2026-02-25T20:07:46Z

src/prime_rl/orchestrator/scheduler.py

+            return example
+
+        exclude_tasks_raw = existing_info.get("dynamic_prompt_exclude_tasks")
+        exclude_tasks = {str(task) for task in exclude_tasks_raw} if isinstance(exclude_tasks_raw, list) else set()


exclude_tasks_raw only handles list type

Low Severity

dynamic_prompt_exclude_tasks is only processed when it is a list. If the environment passes a set, tuple, or other iterable, the condition isinstance(exclude_tasks_raw, list) fails and exclude_tasks becomes empty, so the user's exclusion list is ignored.

cursor · 2026-02-25T20:07:47Z

src/prime_rl/orchestrator/scheduler.py

+
+        source_rollout = self.buffer.sample_random_history_rollout(exclude_tasks=exclude_tasks)
+        if source_rollout is None:
+            return example


Missing self_verification when rollout history empty

High Severity

When dynamic_prompt_mode is self_verification and rollout_history is empty (e.g. at training start), the example is returned unchanged with no self_verification key. The env receives an example that signals self_verification mode but lacks the expected info.self_verification structure, which can cause KeyError when the env accesses it.

faresobeid added 2 commits February 25, 2026 19:31

implement self-verification support

93dd1e7

ruff

e3f31a6

faresobeid marked this pull request as ready for review February 25, 2026 20:00

cursor bot reviewed Feb 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement self-verification support#1890

implement self-verification support#1890
faresobeid wants to merge 2 commits intomainfrom
self-verify

faresobeid commented Feb 25, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 25, 2026

Uh oh!

cursor bot Feb 25, 2026

Uh oh!

cursor bot Feb 25, 2026

Uh oh!

cursor bot Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

faresobeid commented Feb 25, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 25, 2026

Choose a reason for hiding this comment

Task type mismatch in exclude filter

Uh oh!

cursor bot Feb 25, 2026

Choose a reason for hiding this comment

Unbounded rollout history memory growth

Uh oh!

cursor bot Feb 25, 2026

Choose a reason for hiding this comment

exclude_tasks_raw only handles list type

Uh oh!

cursor bot Feb 25, 2026

Choose a reason for hiding this comment

Missing self_verification when rollout history empty

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

faresobeid commented Feb 25, 2026 •

edited by cursor bot

Loading