Fix weight update race condition between trainer cleanup and orchestrator #1857
Open
Fix weight update race condition between trainer cleanup and orchestrator #1857
Conversation
…ator The trainer's maybe_clean deletes broadcast checkpoint directories on a timer, but the orchestrator may not have loaded them yet (its event loop can be blocked for 70+ seconds during rollout generation). This caused FileNotFoundError on the inference server and crashed the run. Fix: - Orchestrator writes a LOADED_STEP marker after each successful weight update so the trainer knows which checkpoints have been consumed. - Trainer only cleans broadcast directories the orchestrator has moved past (candidate_step < loaded_step). - Orchestrator catches update_weights failures and retries on the next poll as defense-in-depth. - Fix DefaultModelLoader import for vllm >=0.16 (module renamed from loader to default_loader). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
samsja
reviewed
Feb 23, 2026
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Member
mikasenghaas
left a comment
There was a problem hiding this comment.
i think it should be fine? im a little lost with our ckpt logic (ie which ckpt goes where) since multi tenant but overall looks sensible
| ) | ||
|
|
||
| # Update weights on inference servers | ||
| # Update weights on inference servers. |
| return | ||
|
|
||
| # Sweep all eligible historical steps so skipped candidates are eventually cleaned. | ||
| for step_dir in path.glob("step_*"): |
Member
There was a problem hiding this comment.
this will break if we ever rename get_step_path but ig we wont haha so ig we're good?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
current bug in prime-rl when having excessive event loop lag, it will cause the cleanup of old broadcast/step_n that wont happen immediately, then inference will check step_n and see its there, then the delete takes place and the following load fails because step_n is missing by the time it tries to load
The trainer's maybe_clean deletes broadcast checkpoint directories on a timer, but the orchestrator may not have loaded them yet (its event loop can be blocked for 70+ seconds during rollout generation). This caused FileNotFoundError on the inference server and crashed the run.
Fix:
Note
Medium Risk
Changes cross-process coordination and filesystem cleanup logic; mistakes could cause disk bloat (under-cleaning) or renewed missing-checkpoint failures (over-cleaning) under laggy conditions.
Overview
Prevents a race where the trainer deletes
broadcast/step_*directories while the orchestrator/inference servers are still loading them.The orchestrator now writes an atomic
LOADED_STEPmarker after successfully callingupdate_weights, and trainer cleanup (maybe_clean) is changed from deleting a single candidate step to sweeping and removing only steps older thanLOADED_STEP(respectinginterval_to_keep).Written by Cursor Bugbot for commit 7bb01ce. This will update automatically on new commits. Configure here.