We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
按照examples/qwen2_5/README.md一步步做无法正常启动训练。 PAI-Megatron-LM-240718/megatron/training/checkpointing.py 855行会报错: 即lm_head没有extra_state 但按之前issue的方式将strict设为False的话,训练的loss是从十几开始,这必然是不对的。 我找到一种替代方法: 就是将toolkits/model_checkpoints_convertor/qwen/hf2mcore_qwen2.5_convertor.sh中1.5B的tie_option=""更改为tie_option="--untie-embeddings-and-output-weights"。 examples/qwen2_5/run_mcore_qwen.sh这个里面的1.5B配置不做修改。 这样启动训练就正常了,loss从3左右开始下降。我想是toolkits/model_checkpoints_convertor/qwen/hf2mcore_qwen2_dense_and_moe_gqa.py这里面的对于lm head的处理有点问题
examples/qwen2_5/README.md
PAI-Megatron-LM-240718/megatron/training/checkpointing.py
toolkits/model_checkpoints_convertor/qwen/hf2mcore_qwen2.5_convertor.sh
tie_option=""
tie_option="--untie-embeddings-and-output-weights"
examples/qwen2_5/run_mcore_qwen.sh
toolkits/model_checkpoints_convertor/qwen/hf2mcore_qwen2_dense_and_moe_gqa.py
The text was updated successfully, but these errors were encountered:
这是tp=2,pp=2的时候报的错,tp=1,pp=1的时候没有报错,也不能以上述方式转换模式,否则有问题
Sorry, something went wrong.
我在用72B的模型做SFT,使用的是idxmap方式+sequence packing,但是起步loss就是5.6左右;最后降到2左右就降不下去了,感觉很奇怪
No branches or pull requests
按照
examples/qwen2_5/README.md
一步步做无法正常启动训练。PAI-Megatron-LM-240718/megatron/training/checkpointing.py
855行会报错:即lm_head没有extra_state
但按之前issue的方式将strict设为False的话,训练的loss是从十几开始,这必然是不对的。
我找到一种替代方法:
就是将
toolkits/model_checkpoints_convertor/qwen/hf2mcore_qwen2.5_convertor.sh
中1.5B的tie_option=""
更改为tie_option="--untie-embeddings-and-output-weights"
。examples/qwen2_5/run_mcore_qwen.sh
这个里面的1.5B配置不做修改。这样启动训练就正常了,loss从3左右开始下降。我想是
toolkits/model_checkpoints_convertor/qwen/hf2mcore_qwen2_dense_and_moe_gqa.py
这里面的对于lm head的处理有点问题The text was updated successfully, but these errors were encountered: