-
Notifications
You must be signed in to change notification settings - Fork 3.1k
[model] feat: add NPU GRPO training scripts for Qwen2.5-32B/Qwen3-30B (Megaton/vLLM backends) #4984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces new GRPO training scripts for Qwen models on Ascend NPUs, along with corresponding documentation. My review focuses on correctness issues within the newly added shell scripts and the documentation file. I have identified several critical syntax errors in the shell scripts that would cause them to fail, likely due to copy-paste mistakes (e.g., + prefixes and the use of an undefined variable). The documentation's code examples contain similar errors. I have provided specific suggestions to rectify these issues.
| +actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_offload_fraction=${optimizer_offload_fraction} | ||
| +actor_rollout_ref.actor.optim.override_optimizer_config.use_precision_aware_optimizer=True | ||
| +actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_cpu_offload=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines have a + prefix, which is invalid syntax for a bash array assignment. This appears to be a copy-paste error from a diff and will cause the script to fail. Please remove the + prefixes.
| +actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_offload_fraction=${optimizer_offload_fraction} | |
| +actor_rollout_ref.actor.optim.override_optimizer_config.use_precision_aware_optimizer=True | |
| +actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_cpu_offload=True | |
| actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_offload_fraction=${optimizer_offload_fraction} | |
| actor_rollout_ref.actor.optim.override_optimizer_config.use_precision_aware_optimizer=True | |
| actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_cpu_offload=True |
| actor_rollout_ref.actor.megatron.grad_offload=${all_offload} | ||
| actor_rollout_ref.actor.megatron.dist_checkpointing_path=${MCORE_MODEL_PATH} | ||
| actor_rollout_ref.actor.megatron.use_dist_checkpointing=False | ||
| +actor_rollout_ref.actor.megatron.override_transformer_config.use_flash_attn=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line has a + prefix, which is invalid syntax for a bash array assignment. This appears to be a copy-paste error from a diff and will cause the script to fail. Please remove the + prefix.
| +actor_rollout_ref.actor.megatron.override_transformer_config.use_flash_attn=True | |
| actor_rollout_ref.actor.megatron.override_transformer_config.use_flash_attn=True |
| "${ACTOR_CONFIG[@]}" \ | ||
| "${REF_CONFIG[@]}" \ | ||
| "${ROLLOUT_CONFIG[@]}" \ | ||
| "${REWARD_MODEL_CONFIG[@]}" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| +actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_offload_fraction=${optimizer_offload_fraction} | ||
| +actor_rollout_ref.actor.optim.override_optimizer_config.use_precision_aware_optimizer=True | ||
| +actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_cpu_offload=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines have a + prefix, which is invalid syntax for a bash array assignment. This appears to be a copy-paste error from a diff and will cause the script to fail. Please remove the + prefixes.
| +actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_offload_fraction=${optimizer_offload_fraction} | |
| +actor_rollout_ref.actor.optim.override_optimizer_config.use_precision_aware_optimizer=True | |
| +actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_cpu_offload=True | |
| actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_offload_fraction=${optimizer_offload_fraction} | |
| actor_rollout_ref.actor.optim.override_optimizer_config.use_precision_aware_optimizer=True | |
| actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_cpu_offload=True |
| +actor_rollout_ref.actor.megatron.override_transformer_config.use_flash_attn=True | ||
| +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform | ||
| +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full | ||
| +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines have a + prefix, which is invalid syntax for a bash array assignment. This appears to be a copy-paste error from a diff and will cause the script to fail. Please remove the + prefixes.
| +actor_rollout_ref.actor.megatron.override_transformer_config.use_flash_attn=True | |
| +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform | |
| +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full | |
| +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1 | |
| actor_rollout_ref.actor.megatron.override_transformer_config.use_flash_attn=True | |
| actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform | |
| actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full | |
| actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1 |
| "${ACTOR_CONFIG[@]}" \ | ||
| "${REF_CONFIG[@]}" \ | ||
| "${ROLLOUT_CONFIG[@]}" \ | ||
| "${REWARD_MODEL_CONFIG[@]}" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| +actor_rollout_ref.actor.megatron.override_transformer_config.use_flash_attn=True | ||
| +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform | ||
| +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full | ||
| +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines in the shell code block have a + prefix, which appears to be a copy-paste artifact from a diff. This makes the example code syntactically incorrect. Please remove the + prefixes to ensure the example is valid.
| +actor_rollout_ref.actor.megatron.override_transformer_config.use_flash_attn=True | |
| +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform | |
| +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full | |
| +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1 | |
| actor_rollout_ref.actor.megatron.override_transformer_config.use_flash_attn=True | |
| actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform | |
| actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full | |
| actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1 |
| actor_rollout_ref.rollout.top_p=${top_p} | ||
| actor_rollout_ref.rollout.top_k=${top_k} | ||
| actor_rollout_ref.rollout.temperature=${temperature} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variables ${top_p}, ${top_k}, and ${temperature} are used here but are not defined in the script example. This will cause an error if the code is executed. Please define these variables or replace them with example values, such as the ones used in the other training scripts (1.0, -1, 1.0 respectively).
| actor_rollout_ref.rollout.top_p=${top_p} | |
| actor_rollout_ref.rollout.top_k=${top_k} | |
| actor_rollout_ref.rollout.temperature=${temperature} | |
| actor_rollout_ref.rollout.top_p=1.0 | |
| actor_rollout_ref.rollout.top_k=-1 | |
| actor_rollout_ref.rollout.temperature=1.0 |
| actor_rollout_ref.rollout.val_kwargs.top_p=${top_p} | ||
| actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} | ||
| actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variables ${top_p}, ${top_k}, and ${temperature} are used here but are not defined in the script example. This will cause an error if the code is executed. Please define these variables or replace them with example values, such as the ones used in the other training scripts (1.0, -1, 1.0 respectively).
| actor_rollout_ref.rollout.val_kwargs.top_p=${top_p} | |
| actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} | |
| actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} | |
| actor_rollout_ref.rollout.val_kwargs.top_p=1.0 | |
| actor_rollout_ref.rollout.val_kwargs.top_k=-1 | |
| actor_rollout_ref.rollout.val_kwargs.temperature=1.0 |
7752678 to
692efda
Compare
…po_megatron_vllm_npu.sh
What does this PR do?
This PR mainly adds the following content for NPU (Ascend) platform:
Test
Qwen2.5-32B grpo training with gms8k :

Qwen3-30B grpo training with gms8k :

API and Usage Example
# Run Qwen2.5-32B GRPO training (Megatron + vLLM backend) on Ascend NPU bash ./examples/grpo_trainer/run_qwen2_5-32b_grpo_megatron_vllm_npu.sh# Run Qwen3MoE-30B GRPO training (Megatron + vLLM backend) on Ascend NPU bash ./examples/grpo_trainer/run_qwen3moe-30b_grpo_megatron_vllm_npu.sh