Skip to content

implement self-verification support#1890

Open
faresobeid wants to merge 2 commits intomainfrom
self-verify
Open

implement self-verification support#1890
faresobeid wants to merge 2 commits intomainfrom
self-verify

Conversation

@faresobeid
Copy link
Contributor

@faresobeid faresobeid commented Feb 25, 2026

Save previous rollout history
Allow for sampling from rollout history when we have dynamic prompt mode = self-verification
dynamic prompt mode also would allow for envs that can generate prompts on the fly like logic tasks, or pairwise stuff


Note

Medium Risk
Changes training-time sampling behavior and checkpoint contents; incorrect history sampling/exclusions could bias data or break assumptions about example immutability across rollouts.

Overview
Adds self-verification dynamic prompting by persisting a non-destructive rollout_history in the orchestrator Buffer, saving/loading it in checkpoints, and exposing sample_random_history_rollout() for random historical sampling.

Updates the rollout Scheduler to deepcopy sampled examples and, when example["info"].dynamic_prompt_mode == "self_verification", inject a sampled historical rollout (task/example_id/reward/prompt/completion) into info.self_verification.source, with optional task exclusions to avoid recursive self-verification.

Also includes a tiny formatting-only tweak to an OrchestratorConfig validation error message.

Written by Cursor Bugbot for commit e3f31a6. This will update automatically on new commits. Configure here.

@faresobeid faresobeid marked this pull request as ready for review February 25, 2026 20:00
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

def sample_random_history_rollout(self, exclude_tasks: set[str] | None = None) -> vf.RolloutOutput | None:
"""Sample one random historical rollout without removing it."""
exclude_tasks = exclude_tasks or set()
eligible = [rollout for rollout in self.rollout_history if rollout["task"] not in exclude_tasks]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Task type mismatch in exclude filter

Medium Severity

In sample_random_history_rollout, the filter uses rollout["task"] not in exclude_tasks, but exclude_tasks is built with str(task) in the scheduler. If rollout["task"] is not a string (e.g. int from JSON or dataset), the comparison fails and rollouts that should be excluded remain eligible for sampling.

Fix in Cursor Fix in Web


def update(self, rollouts: list[vf.RolloutOutput]):
"""Updates the buffer state with completed rollouts."""
self.rollout_history.extend(rollouts)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unbounded rollout history memory growth

Medium Severity

rollout_history is extended with every batch of completed rollouts in update() but is never trimmed. In long training runs this can lead to unbounded memory growth and OOM, since each rollout stores prompt, completion, and other fields.

Fix in Cursor Fix in Web

return example

exclude_tasks_raw = existing_info.get("dynamic_prompt_exclude_tasks")
exclude_tasks = {str(task) for task in exclude_tasks_raw} if isinstance(exclude_tasks_raw, list) else set()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exclude_tasks_raw only handles list type

Low Severity

dynamic_prompt_exclude_tasks is only processed when it is a list. If the environment passes a set, tuple, or other iterable, the condition isinstance(exclude_tasks_raw, list) fails and exclude_tasks becomes empty, so the user's exclusion list is ignored.

Fix in Cursor Fix in Web


source_rollout = self.buffer.sample_random_history_rollout(exclude_tasks=exclude_tasks)
if source_rollout is None:
return example
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing self_verification when rollout history empty

High Severity

When dynamic_prompt_mode is self_verification and rollout_history is empty (e.g. at training start), the example is returned unchanged with no self_verification key. The env receives an example that signals self_verification mode but lacks the expected info.self_verification structure, which can cause KeyError when the env accesses it.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant