[ckpt] feat: add CheckpointEngineManager #5031

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

vermouth1992 merged 11 commits into verl-project:main from wuxibin89:wuxibin/checkpoint_engine_manager

Jan 27, 2026

.github/workflows/gpu_unit_tests.yml

-Original file line number
+Diff line change
@@ Expand Up / @@ -113,7 +113,7 @@ jobs: @@
               pip3 install --ignore-installed mlflow "numpy<2.0"
           - name: Run all GPU unit tests
             run: |
-              pytest -s -x --ignore-glob="*test_special_*.py" --ignore-glob='*on_cpu.py' --ignore-glob="*test_vllm*" --ignore-glob="*_sglang*" --ignore-glob="*_hf_rollout*" --ignore-glob="tests/models/" --ignore-glob='tests/special*' --ignore-glob="tests/experimental" --ignore-glob="tests/workers/reward_model" tests/
+              pytest -s -x --ignore-glob="*on_npu.py" --ignore-glob="*test_special_*.py" --ignore-glob='*on_cpu.py' --ignore-glob="*test_vllm*" --ignore-glob="*_sglang*" --ignore-glob="*_hf_rollout*" --ignore-glob="tests/models/" --ignore-glob='tests/special*' --ignore-glob="tests/experimental" --ignore-glob="tests/workers/reward_model" tests/
           - name: Testing LinearCrossEntropyTP Correctness, Computation Time and Memory Consumption
             run: |
               LOW_MEMORY=True torchrun --standalone --nnodes=1 --nproc-per-node=8 tests/utils/test_special_linear_cross_entropy_tp.py
@@ Expand Down @@

.github/workflows/sgl.yml

-Original file line number
+Diff line change
@@ Expand Up / @@ -113,6 +113,7 @@ jobs: @@
               fetch-depth: 0
           - name: Install the current repository
             run: |
+              pip3 install cupy-cuda12x pytest-asyncio
               pip3 install hf_transfer fastmcp pytest-asyncio
               pip3 install -r requirements-test.txt
               pip3 install --no-deps -e .
@@ Expand All / @@ -124,9 +125,36 @@ jobs: @@
             run: |
               ROLLOUT_NAME=sglang pytest -svvv tests/experimental/agent_loop
+      sgl_checkpoint_engine:
+        needs: setup
+        runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
+        timeout-minutes: 35 # Increase this timeout value as needed
+        env:
+          HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
+          HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
+          NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
+          HF_ENDPOINT: "https://hf-mirror.com"
+          HF_HUB_ENABLE_HF_TRANSFER: 1
+          SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK: "True"
+          NCCL_SHM_DISABLE: "1"
+          NCCL_P2P_DISABLE: "1"
+        steps:
+          - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
+            with:
+              fetch-depth: 0
+          - name: Install the current repository
+            run: |
+              pip3 install cupy-cuda12x pytest-asyncio
+              pip3 install hf_transfer fastmcp pytest-asyncio
+              pip3 install -r requirements-test.txt
+              pip3 install --no-deps -e .
+          - name: Test SGLang ServerAdapter with Checkpoint Engine (NCCL)
+            run: |
+              ROLLOUT_NAME=sglang pytest -svvv tests/checkpoint_engine/test_special_server_adapter.py
       cleanup:
         runs-on: ubuntu-latest
-        needs: [setup, sgl]
+        needs: [setup, sgl, sgl_checkpoint_engine]
         if: always()
         steps:
           - id: destroy-runner
@@ Expand Down @@

.github/workflows/vllm.yml

-Original file line number
+Diff line change
@@ Expand Up / @@ -126,11 +126,33 @@ jobs: @@
           - name: Test vllm server abort functionality
             run: |
               pytest tests/workers/rollout/rollout_vllm/test_vllm_abort.py -v -s
-          # Note(haibin.lin): for any new test, please update gpu_unit_tests.yaml to avoid repeated tests
+      vllm_checkpoint_engine:
+        needs: setup
+        runs-on: ["${{ needs.setup.outputs.runner-label || 'L20x8' }}"]
+        timeout-minutes: 35 # Increase this timeout value as needed
+        env:
+          HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
+          HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
+          NO_PROXY: "localhost,127.0.0.1,hf-mirror.com"
+          HF_ENDPOINT: "https://hf-mirror.com"
+          HF_HUB_ENABLE_HF_TRANSFER: "0" # This is more stable
+        steps:
+          - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
+            with:
+              fetch-depth: 0
+          - name: Install the current repository
+            run: |
+              pip3 install cupy-cuda12x pytest-asyncio
+              pip3 install -r requirements-test.txt
+              pip3 install --no-deps -e .
+          - name: Test vLLM ServerAdapter with Checkpoint Engine (NCCL)
+            run: |
+              ROLLOUT_NAME=vllm pytest -svvv tests/checkpoint_engine/test_special_server_adapter.py
       cleanup:
         runs-on: ubuntu-latest
-        needs: [setup, vllm]
+        needs: [setup, vllm, vllm_checkpoint_engine]
         if: always()
         steps:
           - id: destroy-runner
@@ Expand Down @@

examples/grpo_trainer/run_qwen2-7b_math_megatron_trtllm.sh

-Original file line number
+Diff line change
@@ Expand Up / @@ -72,7 +72,7 @@ python3 -m verl.trainer.main_ppo --config-path=config \ @@
         actor_rollout_ref.rollout.n=5 \
         actor_rollout_ref.rollout.max_num_seqs=${MAX_BATCH_SIZE} \
         actor_rollout_ref.rollout.max_num_batched_tokens=32768 \
-        actor_rollout_ref.rollout.update_weights_bucket_megabytes=4096 \
+        actor_rollout_ref.rollout.checkpoint_engine.update_weights_bucket_megabytes=4096 \
         actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \
         actor_rollout_ref.ref.megatron.tensor_model_parallel_size=${ACTOR_TP} \
         +actor_rollout_ref.rollout.engine_kwargs.trtllm.batch_wait_timeout_iters=32 \
@@ Expand Down @@

examples/grpo_trainer/run_qwen2-7b_math_trtllm.sh

-Original file line number
+Diff line change
@@ Expand Up / @@ -75,7 +75,7 @@ python3 -m verl.trainer.main_ppo \ @@
         actor_rollout_ref.rollout.calculate_log_probs=True \
         actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
         actor_rollout_ref.ref.fsdp_config.param_offload=True \
-        actor_rollout_ref.rollout.update_weights_bucket_megabytes=4096 \
+        actor_rollout_ref.rollout.checkpoint_engine.update_weights_bucket_megabytes=4096 \
         algorithm.use_kl_in_reward=False \
         trainer.critic_warmup=0 \
         trainer.logger='["console","wandb"]' \
@@ Expand Down @@

tests/checkpoint_engine/test_correctness_on_gpu.py

-Original file line number
+Diff line change
@@ -0,0 +1,139 @@
+    # Copyright 2024 Bytedance Ltd. and/or its affiliates
+    #
+    # Licensed under the Apache License, Version 2.0 (the "License");
+    # you may not use this file except in compliance with the License.
+    # You may obtain a copy of the License at
+    #
+    #     http://www.apache.org/licenses/LICENSE-2.0
+    #
+    # Unless required by applicable law or agreed to in writing, software
+    # distributed under the License is distributed on an "AS IS" BASIS,
+    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    # See the License for the specific language governing permissions and
+    # limitations under the License.
+    import os
+    import pytest
+    import ray
+    from tests.checkpoint_engine.test_utils import create_rollout_worker_group, create_trainer_worker_group
+    from verl.checkpoint_engine import CheckpointEngineManager
+    from verl.single_controller.ray.base import (
+        RayResourcePool,
+        split_resource_pool,
+    )
+    from verl.workers.config import CheckpointEngineConfig, HFModelConfig, RolloutConfig
+    @pytest.mark.asyncio
+    @pytest.mark.parametrize("rebuild_group", [False, True])
+    @pytest.mark.parametrize("num_trainer, num_rollout", [(2, 6)])
+    async def test_nccl_checkpoint_engine(
+        rebuild_group,
+        num_trainer,
+        num_rollout,
+        num_nodes=1,
+        num_gpus_per_node=8,
+        check_allclose=True,
+        model_path="~/models/Qwen/Qwen3-8B-Base",
+    ):
+        model_path = os.path.expanduser(model_path)
+        ray.init(
+            runtime_env={
+                "env_vars": {
+                    "UCX_TLS": "rc,tcp,cuda",
+                    "UCX_MAX_RNDV_RAILS": "4",
+                    "UCX_LOG_LEVEL": "INFO",
+                    "VERL_LOGGING_LEVEL": "DEBUG",
+                }
+            }
+        )
+        # initialize config
+        checkpoint_engine_config = CheckpointEngineConfig(
+            backend="nccl", engine_kwargs={"nccl": {"rebuild_group": rebuild_group}}
+        )
+        model_config = HFModelConfig(path=model_path, use_remove_padding=True)
+        rollout_config = RolloutConfig(name="vllm", checkpoint_engine=checkpoint_engine_config)
+        # create trainer and rollout worker group
+        resource_pool = RayResourcePool(process_on_nodes=[num_gpus_per_node] * num_nodes, max_colocate_count=3)
+        trainer_pool, rollout_pool = split_resource_pool(resource_pool, [num_trainer, num_rollout])
+        trainer = create_trainer_worker_group(trainer_pool, model_config, checkpoint_engine_config)
+        trainer.reset()
+        rollout, replicas = await create_rollout_worker_group(rollout_pool, model_config, rollout_config, check_allclose)
+        # create checkpoint engine manager
+        checkpoint_manager = CheckpointEngineManager(backend="nccl", trainer=trainer, replicas=replicas)
+        for _ in range(3):
+            await checkpoint_manager.update_weights()
+            rollout.check_weights()
+        ray.shutdown()
+    @pytest.mark.skip(reason="temporary skip since our ci environment is not ready")
+    @pytest.mark.asyncio
+    @pytest.mark.parametrize("device", ["cuda", "cpu"])
+    @pytest.mark.parametrize("num_trainer, num_rollout", [(2, 6)])
+    async def test_nixl_checkpoint_engine(
+        num_trainer,
+        num_rollout,
+        device,
+        num_nodes=1,
+        num_gpus_per_node=8,
+        check_allclose=True,
+        model_path="~/models/Qwen/Qwen3-8B-Base",
+    ):
+        model_path = os.path.expanduser(model_path)
+        ray.init(
+            runtime_env={
+                "env_vars": {
+                    # TODO: it's pretty hard to set these environment variables right, please consult
+                    # with your network admin. Maybe auto adjust UCX_* according to NCCL_IB_*?
+                    "UCX_TLS": "rc,ud,cuda",
+                    # "UCX_IB_GID_INDEX": "3", # NCCL_IB_GID_INDEX
+                    # "UCX_IB_DEVICES": "mlx5_1:1,mlx5_2:1,mlx5_3:1", # NCCL_IB_HCA
+                    "UCX_RC_TIMEOUT": "30s",  # NCCL_IB_TIMEOUT
+                    "UCX_RC_RETRY_COUNT": "7",  # NCCL_IB_RETRY_COUNT
+                    "UCX_KEEPALIVE_INTERVAL": "1s",
+                    "UCX_KEEPALIVE_NUM_EPS": "10",
+                    "UCX_MAX_RNDV_RAILS": "4",
+                    "UCX_IB_ROCE_REACHABILITY_MODE": "all",
+                    "UCX_LOG_LEVEL": "INFO",
+                    "VERL_LOGGING_LEVEL": "DEBUG",
+                }
+            }
+        )
+        # initialize config
+        checkpoint_engine_config = CheckpointEngineConfig(backend="nixl", engine_kwargs={"nixl": {"device": device}})
+        model_config = HFModelConfig(path=model_path, use_remove_padding=True)
+        rollout_config = RolloutConfig(name="vllm", checkpoint_engine=checkpoint_engine_config)
+        # create trainer and rollout worker group
+        resource_pool = RayResourcePool(process_on_nodes=[num_gpus_per_node] * num_nodes, max_colocate_count=3)
+        trainer_pool, rollout_pool = split_resource_pool(resource_pool, [num_trainer, num_rollout])
+        trainer = create_trainer_worker_group(trainer_pool, model_config, checkpoint_engine_config)
+        trainer.reset()
+        rollout, replicas = await create_rollout_worker_group(rollout_pool, model_config, rollout_config, check_allclose)
+        # create checkpoint engine manager
+        checkpoint_manager = CheckpointEngineManager(backend="nixl", trainer=trainer, replicas=replicas)
+        for _ in range(3):
+            await checkpoint_manager.update_weights()
+            rollout.check_weights()
+        ray.shutdown()
+    if __name__ == "__main__":
+        test_nccl_checkpoint_engine(
+            rebuild_group=False,
+            num_trainer=2,
+            num_rollout=30,
+            num_nodes=4,
+            num_gpus_per_node=8,
+            check_allclose=False,
+            model_path=os.environ["HDFS_ROOT"] + "/model/Qwen3-30B-A3B-Base",
+        )

...int_engine/test_hccl_checkpoint_engine.py → ...ckpoint_engine/test_correctness_on_npu.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -17,17 +17,19 @@ @@
     import ray
     from tests.checkpoint_engine.test_utils import create_rollout_worker_group, create_trainer_worker_group
+    from verl.checkpoint_engine import CheckpointEngineManager
     from verl.single_controller.ray.base import (
         RayResourcePool,
         split_resource_pool,
     )
     from verl.utils.device import get_device_name
+    from verl.workers.config import CheckpointEngineConfig, HFModelConfig, RolloutConfig
-    @pytest.mark.skipif(get_device_name() != "npu", reason="NPU is not available")
-    @pytest.mark.parametrize("rebuild_group", [False, True])
+    @pytest.mark.asyncio
+    @pytest.mark.parametrize("rebuild_group", [False])
     @pytest.mark.parametrize("num_trainer, num_rollout", [(2, 6)])
-    def test_hccl_checkpoint_engine(
+    async def test_hccl_checkpoint_engine(
         rebuild_group,
         num_trainer,
         num_rollout,
@@ Expand All / @@ -48,55 +50,25 @@ def test_hccl_checkpoint_engine( @@
             }
         )
+        # initialize config
+        checkpoint_engine_config = CheckpointEngineConfig(
+            backend="hccl", engine_kwargs={"hccl": {"rebuild_group": rebuild_group}}
+        )
+        model_config = HFModelConfig(path=model_path, use_remove_padding=True)
+        rollout_config = RolloutConfig(name="vllm", checkpoint_engine=checkpoint_engine_config)
+        # create trainer and rollout worker group
         resource_pool = RayResourcePool(process_on_nodes=[num_gpus_per_node] * num_nodes, max_colocate_count=3)
         resource_pool.get_placement_groups(device_name=get_device_name())
         trainer_pool, rollout_pool = split_resource_pool(resource_pool, [num_trainer, num_rollout])
-        checkpoint_kwargs = {
-            "bucket_size": 2 * 1024 * 1024 * 1024,  # 2GB
-            "rebuild_group": rebuild_group,
-        }
-        trainer = create_trainer_worker_group(model_path, trainer_pool, "hccl", checkpoint_kwargs)
+        trainer = create_trainer_worker_group(trainer_pool, model_config, checkpoint_engine_config)
         trainer.reset()
-        rollout = create_rollout_worker_group(
-            model_path, rollout_pool, "hccl", checkpoint_kwargs, check_allclose=check_allclose
-        )
+        rollout, replicas = await create_rollout_worker_group(rollout_pool, model_config, rollout_config, check_allclose)
+        # create checkpoint engine manager
+        checkpoint_manager = CheckpointEngineManager(backend="hccl", trainer=trainer, replicas=replicas)
         for _ in range(3):
-            # 1. prepare all workers
-            metadata = ray.get(
-                trainer.execute_checkpoint_engine(["prepare"] * trainer.world_size)
-                + rollout.execute_checkpoint_engine(["prepare"] * rollout.world_size)
-            )
-            trainer_kwargs = {
-                "method": ["init_process_group"] * trainer.world_size,
-                "rank": [0] + [-1] * (trainer.world_size - 1),
-                "world_size": [rollout.world_size + 1] * trainer.world_size,
-                "master_metadata": [metadata[0]] * trainer.world_size,
-            }
-            rollout_kwargs = {
-                "method": ["init_process_group"] * rollout.world_size,
-                "rank": list(range(1, rollout.world_size + 1)),
-                "world_size": [rollout.world_size + 1] * rollout.world_size,
-                "master_metadata": [metadata[0]] * rollout.world_size,
-            }
-            # 2. init process group between all workers
-            ray.get(
-                trainer.execute_checkpoint_engine(**trainer_kwargs) + rollout.execute_checkpoint_engine(**rollout_kwargs)
-            )
-            # 3. update weights of all workers
-            print("start to upate")
-            ray.get(trainer.update_weights() + rollout.update_weights())
-            # 4. finish all workers
-            ray.get(
-                trainer.execute_checkpoint_engine(["finish"] * trainer.world_size)
-                + rollout.execute_checkpoint_engine(["finish"] * rollout.world_size)
-            )
-            print("end update")
-            # 5. check weights of rollout workers
+            await checkpoint_manager.update_weights()
             rollout.check_weights()
         ray.shutdown()
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ckpt] feat: add CheckpointEngineManager #5031

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!