Skip to content

Fix weight update race condition between trainer cleanup and orchestrator #1857

Open
cdreetz wants to merge 4 commits intomainfrom
fix/weight-update-race-condition
Open

Fix weight update race condition between trainer cleanup and orchestrator #1857
cdreetz wants to merge 4 commits intomainfrom
fix/weight-update-race-condition

Conversation

@cdreetz
Copy link

@cdreetz cdreetz commented Feb 23, 2026

current bug in prime-rl when having excessive event loop lag, it will cause the cleanup of old broadcast/step_n that wont happen immediately, then inference will check step_n and see its there, then the delete takes place and the following load fails because step_n is missing by the time it tries to load

The trainer's maybe_clean deletes broadcast checkpoint directories on a timer, but the orchestrator may not have loaded them yet (its event loop can be blocked for 70+ seconds during rollout generation). This caused FileNotFoundError on the inference server and crashed the run.

Fix:

  • Orchestrator writes a LOADED_STEP marker after each successful weight update so the trainer knows which checkpoints have been consumed.
  • Trainer only cleans broadcast directories the orchestrator has moved past (candidate_step < loaded_step).
  • Orchestrator catches update_weights failures and retries on the next poll as defense-in-depth.
  • Fix DefaultModelLoader import for vllm >=0.16 (module renamed from loader to default_loader).

Note

Medium Risk
Changes cross-process coordination and filesystem cleanup logic; mistakes could cause disk bloat (under-cleaning) or renewed missing-checkpoint failures (over-cleaning) under laggy conditions.

Overview
Prevents a race where the trainer deletes broadcast/step_* directories while the orchestrator/inference servers are still loading them.

The orchestrator now writes an atomic LOADED_STEP marker after successfully calling update_weights, and trainer cleanup (maybe_clean) is changed from deleting a single candidate step to sweeping and removing only steps older than LOADED_STEP (respecting interval_to_keep).

Written by Cursor Bugbot for commit 7bb01ce. This will update automatically on new commits. Configure here.

…ator

The trainer's maybe_clean deletes broadcast checkpoint directories on a
timer, but the orchestrator may not have loaded them yet (its event loop
can be blocked for 70+ seconds during rollout generation). This caused
FileNotFoundError on the inference server and crashed the run.

Fix:
- Orchestrator writes a LOADED_STEP marker after each successful weight
  update so the trainer knows which checkpoints have been consumed.
- Trainer only cleans broadcast directories the orchestrator has moved
  past (candidate_step < loaded_step).
- Orchestrator catches update_weights failures and retries on the next
  poll as defense-in-depth.
- Fix DefaultModelLoader import for vllm >=0.16 (module renamed from
  loader to default_loader).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Copy link
Member

@mikasenghaas mikasenghaas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it should be fine? im a little lost with our ckpt logic (ie which ckpt goes where) since multi tenant but overall looks sensible

)

# Update weights on inference servers
# Update weights on inference servers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😭

return

# Sweep all eligible historical steps so skipped candidates are eventually cleaned.
for step_dir in path.glob("step_*"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will break if we ever rename get_step_path but ig we wont haha so ig we're good?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants