Skip to content

fix: Fix device mismatch when DPO runs validation at start with CPU offload (Nemotron MoE)#1930

Draft
RayenTian wants to merge 1 commit intomainfrom
ruit/fix_dpo_cpu_offload
Draft

fix: Fix device mismatch when DPO runs validation at start with CPU offload (Nemotron MoE)#1930
RayenTian wants to merge 1 commit intomainfrom
ruit/fix_dpo_cpu_offload

Conversation

@RayenTian
Copy link
Contributor

@RayenTian RayenTian commented Feb 12, 2026

What does this PR do ?

Summary
When DPO is run with val_at_start=true and policy.dtensor_cfg.cpu_offload=true, validation can crash on MoE models (e.g. NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) with:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

The failure occurs in the model’s MoE gate (e.g. get_topk_indices: scores on CUDA, e_score_correction_bias on CPU).
Dense models (e.g. Qwen2.5-7B) do not hit this under the same settings.
This PR fixes the bug by calling policy.prepare_for_lp_inference() before running the initial validation when val_at_start is enabled, so that all buffers (including MoE gate buffers) are on CUDA before any reference-policy logprob computation.

Issues

Related to #1922

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

@RayenTian RayenTian added the CI:L1 Run doctests, unit tests, and functional tests label Feb 12, 2026
@RayenTian RayenTian changed the title fix: fix dpo device mismatch bug when enable cpu_offload and val_at_start fix: Fix device mismatch when DPO runs validation at start with CPU offload (Nemotron MoE) Feb 12, 2026
@RayenTian RayenTian force-pushed the ruit/fix_dpo_cpu_offload branch from 80fa04c to e77f145 Compare February 12, 2026 07:41
@RayenTian RayenTian force-pushed the ruit/fix_dpo_cpu_offload branch from e77f145 to 662aebc Compare February 12, 2026 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant