[ckpt] feat: add CheckpointEngineManager #5031

wuxibin89 · 2026-01-23T17:14:44Z

What does this PR do?

#4280 refactor vllm breaking one-step-off-policy and fully-async. This PR introduce CheckpointEngineManager to coordinate weight synchronization between trainer and rollout replicas.

Next PR, refactor one-step-off-policy and fully-async with CheckpointEngineManager.

design doc: https://github.com/volcengine/verl/tree/main/verl/checkpoint_engine

gemini-code-assist

Code Review

This pull request introduces the CheckpointEngineManager to streamline weight synchronization between trainer and rollout replicas, which is a significant architectural improvement. It refactors existing checkpoint engine tests to use this new manager and introduces new test cases. The changes also include updates to configuration structures and utility functions to support asynchronous operations and ensure correctness during weight transfers.

Shangwei-Li · 2026-01-27T02:34:35Z

verl/checkpoint_engine/base.py

+        self,
+        backend: str,
+        trainer: RayWorkerGroup,
+        replicas: list[RolloutReplica],


Suggesting avoid initializing from workers and let AgentLoop register itself instead, hiding replicas from CheckpointEngineManager. Otherwise if we use P2P backend to support elastic rollout, we need to go through all the labors to pass replicas back to CheckpointEngineManager even if we don't need to rebuild process group.

wuxibin89 · 2026-01-27T03:59:47Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a CheckpointEngineManager to centralize and streamline weight synchronization between trainer and rollout replicas. This is a significant architectural improvement that enhances the management of distributed model updates. The changes involve refactoring configuration structures, abstracting checkpoint engine operations, and updating various worker implementations to integrate with the new manager. This new abstraction is well-designed and improves the clarity and maintainability of the weight synchronization logic. However, there are a few areas that need attention to ensure full functional correctness and test coverage.

tests/checkpoint_engine/test_correctness_on_gpu.py

tests/checkpoint_engine/test_correctness_on_npu.py

verl/checkpoint_engine/base.py

verl/workers/rollout/sglang_rollout/sglang_rollout.py

chenjiaoAngel · 2026-01-27T07:35:40Z

verl/experimental/agent_loop/agent_loop.py

        self._init_agent_loop_workers()

-        # Initially we're in sleep mode.
-        if self.config.actor_rollout_ref.rollout.free_cache_engine:


checkpoint engine 支持后，不需要再用sleep/wakeup 去切换了，是吗？

chenjiaoAngel · 2026-01-27T07:38:19Z

verl/checkpoint_engine/base.py

 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-


Is there any relevant documentation for the checkpoint engine? Once this feature is supported, will it no longer impede the subsequent Elastic rollout?

wuxibin89 requested review from PeterSH6, chenhaiq, eric-haibin-lin, tongyx361, vermouth1992, zhaochenyang20 and zw0610 as code owners January 23, 2026 17:14

wuxibin89 changed the title ~~feat: add CheckpointEngineManager~~ [ckpt] feat: add CheckpointEngineManager Jan 23, 2026

gemini-code-assist bot reviewed Jan 23, 2026

View reviewed changes

wuxibin89 requested a review from ArronHZG January 23, 2026 17:23

wuxibin89 mentioned this pull request Jan 26, 2026

[roadmap] verl Q1 roadmap #4880

Open

30 tasks

wuxibin89 requested review from FightingZhen, ISEEKYAN, ji-huazhong and tardis-key as code owners January 26, 2026 14:51

wuxibin89 added 8 commits January 26, 2026 22:52

feat: add CheckpointEngineManager

f14e266

fix sanity

1e56364

integrate CheckpointEngineManager into RayPPOTrainer

fda97c5

fix trainer

c41e8fa

fix trainer

df63e0b

fix legacy workers

f9087ce

fix TQ trainer

17f1979

fix legacy megatron worker

79fa28e

wuxibin89 force-pushed the wuxibin/checkpoint_engine_manager branch from 9452913 to 79fa28e Compare January 26, 2026 15:20

wuxibin89 added 2 commits January 26, 2026 23:38

fix cupy import error

7e181cf

fix cupy import error

4fc7060

Shangwei-Li reviewed Jan 27, 2026

View reviewed changes

fix ci

079d38c

gemini-code-assist bot reviewed Jan 27, 2026

View reviewed changes

vermouth1992 approved these changes Jan 27, 2026

View reviewed changes

vermouth1992 merged commit ef6eaa0 into verl-project:main Jan 27, 2026
81 of 101 checks passed

chenjiaoAngel reviewed Jan 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ckpt] feat: add CheckpointEngineManager #5031

[ckpt] feat: add CheckpointEngineManager #5031

Uh oh!

wuxibin89 commented Jan 23, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Shangwei-Li Jan 27, 2026 •

edited

Loading

Uh oh!

wuxibin89 commented Jan 27, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chenjiaoAngel Jan 27, 2026

Uh oh!

chenjiaoAngel Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[ckpt] feat: add CheckpointEngineManager #5031

[ckpt] feat: add CheckpointEngineManager #5031

Uh oh!

Conversation

wuxibin89 commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Shangwei-Li Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wuxibin89 commented Jan 27, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chenjiaoAngel Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

chenjiaoAngel Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wuxibin89 commented Jan 23, 2026 •

edited

Loading

Shangwei-Li Jan 27, 2026 •

edited

Loading