-
Notifications
You must be signed in to change notification settings - Fork 3.1k
[ckpt] feat: add CheckpointEngineManager #5031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ckpt] feat: add CheckpointEngineManager #5031
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the CheckpointEngineManager to streamline weight synchronization between trainer and rollout replicas, which is a significant architectural improvement. It refactors existing checkpoint engine tests to use this new manager and introduces new test cases. The changes also include updates to configuration structures and utility functions to support asynchronous operations and ensure correctness during weight transfers.
9452913 to
79fa28e
Compare
| self, | ||
| backend: str, | ||
| trainer: RayWorkerGroup, | ||
| replicas: list[RolloutReplica], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggesting avoid initializing from workers and let AgentLoop register itself instead, hiding replicas from CheckpointEngineManager. Otherwise if we use P2P backend to support elastic rollout, we need to go through all the labors to pass replicas back to CheckpointEngineManager even if we don't need to rebuild process group.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a CheckpointEngineManager to centralize and streamline weight synchronization between trainer and rollout replicas. This is a significant architectural improvement that enhances the management of distributed model updates. The changes involve refactoring configuration structures, abstracting checkpoint engine operations, and updating various worker implementations to integrate with the new manager. This new abstraction is well-designed and improves the clarity and maintainability of the weight synchronization logic. However, there are a few areas that need attention to ensure full functional correctness and test coverage.
| self._init_agent_loop_workers() | ||
|
|
||
| # Initially we're in sleep mode. | ||
| if self.config.actor_rollout_ref.rollout.free_cache_engine: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkpoint engine 支持后,不需要再用sleep/wakeup 去切换了,是吗?
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any relevant documentation for the checkpoint engine? Once this feature is supported, will it no longer impede the subsequent Elastic rollout?
What does this PR do?
#4280 refactor vllm breaking
one-step-off-policyandfully-async. This PR introduce CheckpointEngineManager to coordinate weight synchronization between trainer and rollout replicas.Next PR, refactor
one-step-off-policyandfully-asyncwith CheckpointEngineManager.design doc: https://github.com/volcengine/verl/tree/main/verl/checkpoint_engine