Skip to content

individual rollouts#1865

Open
faresobeid wants to merge 6 commits intomainfrom
individual-rollouts
Open

individual rollouts#1865
faresobeid wants to merge 6 commits intomainfrom
individual-rollouts

Conversation

@faresobeid
Copy link
Contributor

@faresobeid faresobeid commented Feb 24, 2026

Re-write the scheduler to do rollouts at the rollout level instead of group level. This way any long tails within groups are handled, so when a rollout within a group finishes it can move on to another group.
Currently if any env needs group scoring, we go back to previous behaviour of group rollouts although ideally we make it so verifiers is closer here and we can do scoring within prime-rl


Note

High Risk
Refactors the core training scheduler from group-level to per-rollout scheduling and changes when/where scoring happens for group-based rubrics, which can affect training throughput and reward correctness.

Overview
Reworks training rollout scheduling to operate at the individual rollout level rather than run_group, keeping max_inflight_rollouts filled while independently tracking per-example groups and only emitting a group once all rollouts_per_example complete.

Adds deferred group scoring for environments whose rubrics require group-level reward functions: orchestrator disables per-rollout scoring (score_rollouts=False) for those tasks and the scheduler scores the completed group via rubric.score_group after aggregation (with warnings for externally-managed env servers).

Introduces vf_utils.run_rollout() wrapper and updates scheduler metrics to distinguish in-flight rollouts vs. total pending samples, with unit tests covering the new metric behavior.

Written by Cursor Bugbot for commit e9ad626. This will update automatically on new commits. Configure here.

Copy link
Member

@mikasenghaas mikasenghaas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, yea lets do some testing on this to verify that we dont have any async race conditions anywhere but directionally looks good.

i think mid-term we want to move away from the verifiers env group for training envs and make our abstractions "multi-env" aware by default, e.g. smth like having a buffer and scheduler per env goverened a "scheduling" component on top or smth bc i think we will want more and more fine-grained control over how each env behaves (e.g. here whether or not to use gruop scheduling) and its always awkward to code this in an abstraction that handles multiple cases where you need conditional everywhere

@faresobeid faresobeid changed the base branch from tokne-batch to main February 25, 2026 14:43
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

@samsja
Copy link
Member

samsja commented Feb 25, 2026

lets wait for @mikasenghaas review before merging

Co-authored-by: faresobeid <faresobeid@users.noreply.github.com>
raise ValueError(
"max_inflight_rollouts conflicts with oversampling_factor * batch_size"
)
raise ValueError("max_inflight_rollouts conflicts with oversampling_factor * batch_size")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo, could just deprectea oversampling_factor, no?

env_names=train_env_names,
map_kwargs=dict(writer_batch_size=1), # set defensively to not error on map operations on large datasets
)
verification_enabled = not config.buffer.skip_verification
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait i never knew abt this arg, what is it for?

)
verification_enabled = not config.buffer.skip_verification

def task_uses_group_scoring(task_name: str) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would prefer to not have this logic in orch, could maybe put into vf_utils?

Comment on lines +93 to +95
self.group_examples: dict[int, dict] = {}
self.group_rollouts_to_schedule: dict[int, int] = {}
self.completed_group_rollouts: dict[int, list[vf.RolloutOutput]] = defaultdict(list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonder if we need all of these data structures or if some can be merged. e.g. group_rollouts_to_schedule[group_id] seems redundant with group_size - len(completed_group_rollouts[group_id]

off_policy_steps=0, client_config=client_config, group_id=group_id
)

def _inflight_rollout_count(self) -> int:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can make this public imo, also nice to decorate thes with @property imo

return scheduler


def test_inflight_sample_count_includes_pending_group_rollouts():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure these tests are super useful as is haha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants