Conversation
mikasenghaas
left a comment
There was a problem hiding this comment.
nice, yea lets do some testing on this to verify that we dont have any async race conditions anywhere but directionally looks good.
i think mid-term we want to move away from the verifiers env group for training envs and make our abstractions "multi-env" aware by default, e.g. smth like having a buffer and scheduler per env goverened a "scheduling" component on top or smth bc i think we will want more and more fine-grained control over how each env behaves (e.g. here whether or not to use gruop scheduling) and its always awkward to code this in an abstraction that handles multiple cases where you need conditional everywhere
b33268a to
f788c40
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
|
lets wait for @mikasenghaas review before merging |
Co-authored-by: faresobeid <faresobeid@users.noreply.github.com>
| raise ValueError( | ||
| "max_inflight_rollouts conflicts with oversampling_factor * batch_size" | ||
| ) | ||
| raise ValueError("max_inflight_rollouts conflicts with oversampling_factor * batch_size") |
There was a problem hiding this comment.
imo, could just deprectea oversampling_factor, no?
| env_names=train_env_names, | ||
| map_kwargs=dict(writer_batch_size=1), # set defensively to not error on map operations on large datasets | ||
| ) | ||
| verification_enabled = not config.buffer.skip_verification |
There was a problem hiding this comment.
wait i never knew abt this arg, what is it for?
| ) | ||
| verification_enabled = not config.buffer.skip_verification | ||
|
|
||
| def task_uses_group_scoring(task_name: str) -> bool: |
There was a problem hiding this comment.
would prefer to not have this logic in orch, could maybe put into vf_utils?
| self.group_examples: dict[int, dict] = {} | ||
| self.group_rollouts_to_schedule: dict[int, int] = {} | ||
| self.completed_group_rollouts: dict[int, list[vf.RolloutOutput]] = defaultdict(list) |
There was a problem hiding this comment.
wonder if we need all of these data structures or if some can be merged. e.g. group_rollouts_to_schedule[group_id] seems redundant with group_size - len(completed_group_rollouts[group_id]
| off_policy_steps=0, client_config=client_config, group_id=group_id | ||
| ) | ||
|
|
||
| def _inflight_rollout_count(self) -> int: |
There was a problem hiding this comment.
can make this public imo, also nice to decorate thes with @property imo
| return scheduler | ||
|
|
||
|
|
||
| def test_inflight_sample_count_includes_pending_group_rollouts(): |
There was a problem hiding this comment.
not sure these tests are super useful as is haha
Re-write the scheduler to do rollouts at the rollout level instead of group level. This way any long tails within groups are handled, so when a rollout within a group finishes it can move on to another group.
Currently if any env needs group scoring, we go back to previous behaviour of group rollouts although ideally we make it so verifiers is closer here and we can do scoring within prime-rl
Note
High Risk
Refactors the core training scheduler from group-level to per-rollout scheduling and changes when/where scoring happens for group-based rubrics, which can affect training throughput and reward correctness.
Overview
Reworks training rollout scheduling to operate at the individual rollout level rather than
run_group, keepingmax_inflight_rolloutsfilled while independently tracking per-example groups and only emitting a group once allrollouts_per_examplecomplete.Adds deferred group scoring for environments whose rubrics require group-level reward functions: orchestrator disables per-rollout scoring (
score_rollouts=False) for those tasks and the scheduler scores the completed group viarubric.score_groupafter aggregation (with warnings for externally-managed env servers).Introduces
vf_utils.run_rollout()wrapper and updates scheduler metrics to distinguish in-flight rollouts vs. total pending samples, with unit tests covering the new metric behavior.Written by Cursor Bugbot for commit e9ad626. This will update automatically on new commits. Configure here.