Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qwen2.5 1.5B 无法正常训练 #386

Open
Traveller2001 opened this issue Nov 24, 2024 · 2 comments
Open

qwen2.5 1.5B 无法正常训练 #386

Traveller2001 opened this issue Nov 24, 2024 · 2 comments

Comments

@Traveller2001
Copy link

Traveller2001 commented Nov 24, 2024

按照examples/qwen2_5/README.md一步步做无法正常启动训练。
PAI-Megatron-LM-240718/megatron/training/checkpointing.py 855行会报错:
即lm_head没有extra_state
但按之前issue的方式将strict设为False的话,训练的loss是从十几开始,这必然是不对的。
我找到一种替代方法:
就是将toolkits/model_checkpoints_convertor/qwen/hf2mcore_qwen2.5_convertor.sh中1.5B的tie_option=""更改为tie_option="--untie-embeddings-and-output-weights"
examples/qwen2_5/run_mcore_qwen.sh这个里面的1.5B配置不做修改。
这样启动训练就正常了,loss从3左右开始下降。我想是toolkits/model_checkpoints_convertor/qwen/hf2mcore_qwen2_dense_and_moe_gqa.py这里面的对于lm head的处理有点问题

@Traveller2001
Copy link
Author

Traveller2001 commented Nov 24, 2024

这是tp=2,pp=2的时候报的错,tp=1,pp=1的时候没有报错,也不能以上述方式转换模式,否则有问题

@yuleiqin
Copy link

yuleiqin commented Dec 2, 2024

我在用72B的模型做SFT,使用的是idxmap方式+sequence packing,但是起步loss就是5.6左右;最后降到2左右就降不下去了,感觉很奇怪

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants