Skip to content

Conversation

@wuxibin89
Copy link
Collaborator

@wuxibin89 wuxibin89 commented Jan 23, 2026

What does this PR do?

#4280 refactor vllm breaking one-step-off-policy and fully-async. This PR introduce CheckpointEngineManager to coordinate weight synchronization between trainer and rollout replicas.

Next PR, refactor one-step-off-policy and fully-async with CheckpointEngineManager.

design doc: https://github.com/volcengine/verl/tree/main/verl/checkpoint_engine

@wuxibin89 wuxibin89 changed the title feat: add CheckpointEngineManager [ckpt] feat: add CheckpointEngineManager Jan 23, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the CheckpointEngineManager to streamline weight synchronization between trainer and rollout replicas, which is a significant architectural improvement. It refactors existing checkpoint engine tests to use this new manager and introduces new test cases. The changes also include updates to configuration structures and utility functions to support asynchronous operations and ensure correctness during weight transfers.

@wuxibin89 wuxibin89 force-pushed the wuxibin/checkpoint_engine_manager branch from 9452913 to 79fa28e Compare January 26, 2026 15:20
self,
backend: str,
trainer: RayWorkerGroup,
replicas: list[RolloutReplica],
Copy link
Contributor

@Shangwei-Li Shangwei-Li Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggesting avoid initializing from workers and let AgentLoop register itself instead, hiding replicas from CheckpointEngineManager. Otherwise if we use P2P backend to support elastic rollout, we need to go through all the labors to pass replicas back to CheckpointEngineManager even if we don't need to rebuild process group.

@wuxibin89
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a CheckpointEngineManager to centralize and streamline weight synchronization between trainer and rollout replicas. This is a significant architectural improvement that enhances the management of distributed model updates. The changes involve refactoring configuration structures, abstracting checkpoint engine operations, and updating various worker implementations to integrate with the new manager. This new abstraction is well-designed and improves the clarity and maintainability of the weight synchronization logic. However, there are a few areas that need attention to ensure full functional correctness and test coverage.

@vermouth1992 vermouth1992 merged commit ef6eaa0 into verl-project:main Jan 27, 2026
81 of 101 checks passed
self._init_agent_loop_workers()

# Initially we're in sleep mode.
if self.config.actor_rollout_ref.rollout.free_cache_engine:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkpoint engine 支持后,不需要再用sleep/wakeup 去切换了,是吗?

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any relevant documentation for the checkpoint engine? Once this feature is supported, will it no longer impede the subsequent Elastic rollout?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants