Skip to content

Conversation

@psyloy
Copy link

@psyloy psyloy commented Jan 19, 2026

What does this PR do?

This PR mainly adds the following content for NPU (Ascend) platform:

  • Add GRPO training scripts for Qwen2.5-32B model based on Megaton and vLLM backends.
  • Add GRPO training scripts for Qwen3-30B model based on Megaton and vLLM backends.

Test

Qwen2.5-32B grpo training with gms8k :
metrics_trends_npu-qwen2 5-32b-100-steps

Qwen3-30B grpo training with gms8k :
metrics_trends_npu-qwen3-30b-100-steps

API and Usage Example

# Run Qwen2.5-32B GRPO training (Megatron + vLLM backend) on Ascend NPU
bash ./examples/grpo_trainer/run_qwen2_5-32b_grpo_megatron_vllm_npu.sh
# Run Qwen3MoE-30B GRPO training (Megatron + vLLM backend) on Ascend NPU
bash ./examples/grpo_trainer/run_qwen3moe-30b_grpo_megatron_vllm_npu.sh

@CLAassistant
Copy link

CLAassistant commented Jan 19, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new GRPO training scripts for Qwen models on Ascend NPUs, along with corresponding documentation. My review focuses on correctness issues within the newly added shell scripts and the documentation file. I have identified several critical syntax errors in the shell scripts that would cause them to fail, likely due to copy-paste mistakes (e.g., + prefixes and the use of an undefined variable). The documentation's code examples contain similar errors. I have provided specific suggestions to rectify these issues.

Comment on lines +98 to +102
+actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_offload_fraction=${optimizer_offload_fraction}
+actor_rollout_ref.actor.optim.override_optimizer_config.use_precision_aware_optimizer=True
+actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_cpu_offload=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

These lines have a + prefix, which is invalid syntax for a bash array assignment. This appears to be a copy-paste error from a diff and will cause the script to fail. Please remove the + prefixes.

Suggested change
+actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_offload_fraction=${optimizer_offload_fraction}
+actor_rollout_ref.actor.optim.override_optimizer_config.use_precision_aware_optimizer=True
+actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_cpu_offload=True
actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_offload_fraction=${optimizer_offload_fraction}
actor_rollout_ref.actor.optim.override_optimizer_config.use_precision_aware_optimizer=True
actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_cpu_offload=True

actor_rollout_ref.actor.megatron.grad_offload=${all_offload}
actor_rollout_ref.actor.megatron.dist_checkpointing_path=${MCORE_MODEL_PATH}
actor_rollout_ref.actor.megatron.use_dist_checkpointing=False
+actor_rollout_ref.actor.megatron.override_transformer_config.use_flash_attn=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This line has a + prefix, which is invalid syntax for a bash array assignment. This appears to be a copy-paste error from a diff and will cause the script to fail. Please remove the + prefix.

Suggested change
+actor_rollout_ref.actor.megatron.override_transformer_config.use_flash_attn=True
actor_rollout_ref.actor.megatron.override_transformer_config.use_flash_attn=True

"${ACTOR_CONFIG[@]}" \
"${REF_CONFIG[@]}" \
"${ROLLOUT_CONFIG[@]}" \
"${REWARD_MODEL_CONFIG[@]}" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The REWARD_MODEL_CONFIG variable is used here but is not defined anywhere in the script. Since set -u is active, this will cause the script to fail with an 'unbound variable' error. This parameter does not seem to be required according to the provided documentation. Please remove this line.

Comment on lines +98 to +102
+actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_offload_fraction=${optimizer_offload_fraction}
+actor_rollout_ref.actor.optim.override_optimizer_config.use_precision_aware_optimizer=True
+actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_cpu_offload=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

These lines have a + prefix, which is invalid syntax for a bash array assignment. This appears to be a copy-paste error from a diff and will cause the script to fail. Please remove the + prefixes.

Suggested change
+actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_offload_fraction=${optimizer_offload_fraction}
+actor_rollout_ref.actor.optim.override_optimizer_config.use_precision_aware_optimizer=True
+actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_cpu_offload=True
actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_offload_fraction=${optimizer_offload_fraction}
actor_rollout_ref.actor.optim.override_optimizer_config.use_precision_aware_optimizer=True
actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_cpu_offload=True

Comment on lines +111 to +116
+actor_rollout_ref.actor.megatron.override_transformer_config.use_flash_attn=True
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

These lines have a + prefix, which is invalid syntax for a bash array assignment. This appears to be a copy-paste error from a diff and will cause the script to fail. Please remove the + prefixes.

Suggested change
+actor_rollout_ref.actor.megatron.override_transformer_config.use_flash_attn=True
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1
actor_rollout_ref.actor.megatron.override_transformer_config.use_flash_attn=True
actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform
actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full
actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1

"${ACTOR_CONFIG[@]}" \
"${REF_CONFIG[@]}" \
"${ROLLOUT_CONFIG[@]}" \
"${REWARD_MODEL_CONFIG[@]}" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The REWARD_MODEL_CONFIG variable is used here but is not defined anywhere in the script. Since set -u is active, this will cause the script to fail with an 'unbound variable' error. This parameter does not seem to be required according to the provided documentation. Please remove this line.

Comment on lines 176 to 179
+actor_rollout_ref.actor.megatron.override_transformer_config.use_flash_attn=True
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

These lines in the shell code block have a + prefix, which appears to be a copy-paste artifact from a diff. This makes the example code syntactically incorrect. Please remove the + prefixes to ensure the example is valid.

Suggested change
+actor_rollout_ref.actor.megatron.override_transformer_config.use_flash_attn=True
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1
actor_rollout_ref.actor.megatron.override_transformer_config.use_flash_attn=True
actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform
actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full
actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1

Comment on lines 217 to 219
actor_rollout_ref.rollout.top_p=${top_p}
actor_rollout_ref.rollout.top_k=${top_k}
actor_rollout_ref.rollout.temperature=${temperature}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The variables ${top_p}, ${top_k}, and ${temperature} are used here but are not defined in the script example. This will cause an error if the code is executed. Please define these variables or replace them with example values, such as the ones used in the other training scripts (1.0, -1, 1.0 respectively).

Suggested change
actor_rollout_ref.rollout.top_p=${top_p}
actor_rollout_ref.rollout.top_k=${top_k}
actor_rollout_ref.rollout.temperature=${temperature}
actor_rollout_ref.rollout.top_p=1.0
actor_rollout_ref.rollout.top_k=-1
actor_rollout_ref.rollout.temperature=1.0

Comment on lines 240 to 242
actor_rollout_ref.rollout.val_kwargs.top_p=${top_p}
actor_rollout_ref.rollout.val_kwargs.top_k=${top_k}
actor_rollout_ref.rollout.val_kwargs.temperature=${temperature}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The variables ${top_p}, ${top_k}, and ${temperature} are used here but are not defined in the script example. This will cause an error if the code is executed. Please define these variables or replace them with example values, such as the ones used in the other training scripts (1.0, -1, 1.0 respectively).

Suggested change
actor_rollout_ref.rollout.val_kwargs.top_p=${top_p}
actor_rollout_ref.rollout.val_kwargs.top_k=${top_k}
actor_rollout_ref.rollout.val_kwargs.temperature=${temperature}
actor_rollout_ref.rollout.val_kwargs.top_p=1.0
actor_rollout_ref.rollout.val_kwargs.top_k=-1
actor_rollout_ref.rollout.val_kwargs.temperature=1.0

@psyloy psyloy force-pushed the main branch 2 times, most recently from 7752678 to 692efda Compare January 26, 2026 08:58
@psyloy psyloy requested a review from tardis-key as a code owner January 26, 2026 08:58
@psyloy psyloy changed the title [model,doc] feat: add NPU GRPO training scripts for Qwen2.5-32B/Qwen3-30B (Megaton/vLLM backends) [model] feat: add NPU GRPO training scripts for Qwen2.5-32B/Qwen3-30B (Megaton/vLLM backends) Jan 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants