Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 4 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| def sample_random_history_rollout(self, exclude_tasks: set[str] | None = None) -> vf.RolloutOutput | None: | ||
| """Sample one random historical rollout without removing it.""" | ||
| exclude_tasks = exclude_tasks or set() | ||
| eligible = [rollout for rollout in self.rollout_history if rollout["task"] not in exclude_tasks] |
There was a problem hiding this comment.
Task type mismatch in exclude filter
Medium Severity
In sample_random_history_rollout, the filter uses rollout["task"] not in exclude_tasks, but exclude_tasks is built with str(task) in the scheduler. If rollout["task"] is not a string (e.g. int from JSON or dataset), the comparison fails and rollouts that should be excluded remain eligible for sampling.
|
|
||
| def update(self, rollouts: list[vf.RolloutOutput]): | ||
| """Updates the buffer state with completed rollouts.""" | ||
| self.rollout_history.extend(rollouts) |
There was a problem hiding this comment.
| return example | ||
|
|
||
| exclude_tasks_raw = existing_info.get("dynamic_prompt_exclude_tasks") | ||
| exclude_tasks = {str(task) for task in exclude_tasks_raw} if isinstance(exclude_tasks_raw, list) else set() |
There was a problem hiding this comment.
exclude_tasks_raw only handles list type
Low Severity
dynamic_prompt_exclude_tasks is only processed when it is a list. If the environment passes a set, tuple, or other iterable, the condition isinstance(exclude_tasks_raw, list) fails and exclude_tasks becomes empty, so the user's exclusion list is ignored.
|
|
||
| source_rollout = self.buffer.sample_random_history_rollout(exclude_tasks=exclude_tasks) | ||
| if source_rollout is None: | ||
| return example |
There was a problem hiding this comment.
Missing self_verification when rollout history empty
High Severity
When dynamic_prompt_mode is self_verification and rollout_history is empty (e.g. at training start), the example is returned unchanged with no self_verification key. The env receives an example that signals self_verification mode but lacks the expected info.self_verification structure, which can cause KeyError when the env accesses it.


Save previous rollout history
Allow for sampling from rollout history when we have dynamic prompt mode = self-verification
dynamic prompt mode also would allow for envs that can generate prompts on the fly like logic tasks, or pairwise stuff
Note
Medium Risk
Changes training-time sampling behavior and checkpoint contents; incorrect history sampling/exclusions could bias data or break assumptions about example immutability across rollouts.
Overview
Adds self-verification dynamic prompting by persisting a non-destructive
rollout_historyin the orchestratorBuffer, saving/loading it in checkpoints, and exposingsample_random_history_rollout()for random historical sampling.Updates the rollout
Schedulerto deepcopy sampled examples and, whenexample["info"].dynamic_prompt_mode == "self_verification", inject a sampled historical rollout (task/example_id/reward/prompt/completion) intoinfo.self_verification.source, with optional task exclusions to avoid recursive self-verification.Also includes a tiny formatting-only tweak to an
OrchestratorConfigvalidation error message.Written by Cursor Bugbot for commit e3f31a6. This will update automatically on new commits. Configure here.