Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen2.5-7B 无法进行继续预训练 #409

Open
YueXiao1995 opened this issue Dec 29, 2024 · 2 comments
Open

Qwen2.5-7B 无法进行继续预训练 #409

YueXiao1995 opened this issue Dec 29, 2024 · 2 comments

Comments

@YueXiao1995
Copy link

YueXiao1995 commented Dec 29, 2024

在阿里的PAI平台,尝试对Qwen-2.5 进行继续预训练,参考Pai-Megatron-Patch/blob/main/examples/qwen2_5/README.md
先参考文档用单卡对0.5B的模型进行继续预训练,可以跑通。
但后续
增加资源到8卡,对7B模型继续预训练时却失败了**,每一张卡显存都满了,但是用pid去查占用资源的进程又不存在,应该是代码运行过程过产生的。

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 28.37 GiB. GPU 5 has a total capacity of 31.75 GiB of which 2.65 GiB is free. Process 13924 has 29.10 GiB memory in use. Of the allocated memory 28.42 GiB is allocated by PyTorch, and 17.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

更为多的的报错:
[rank5]: Traceback (most recent call last):
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/examples/qwen2_5/../qwen2/pretrain_qwen.py", line 278, in
[rank5]: pretrain(
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/training/training.py", line 266, in pretrain
[rank5]: model, optimizer, opt_param_scheduler = setup_model_and_optimizer(model_provider, model_type)
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/training/training.py", line 600, in setup_model_and_optimizer
[rank5]: model = get_model(model_provider_func, model_type)
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/training/training.py", line 516, in get_model
[rank5]: model = [
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/training/training.py", line 517, in
[rank5]: DDP(
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/core/distributed/distributed_data_parallel.py", line 161, in init
[rank5]: self.buffers = allocate_buffers_for_parameters(
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/core/distributed/distributed_data_parallel.py", line 130, in allocate_buffers_for_parameters
[rank5]: ParamAndGradBuffer(
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/core/distributed/param_and_grad_buffer.py", line 366, in init
[rank5]: self.grad_data = torch.zeros(
[rank5]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 28.37 GiB. GPU 5 has a total capacity of 31.75 GiB of which 2.65 GiB is free. Process 13924 has 29.10 GiB memory in use. Of the allocated memory 28.42 GiB is allocated by PyTorch, and 17.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

计算资源挺贵的,看哪个大佬能给看下这是啥问题

@lostkevin
Copy link
Contributor

您好,是不是没有调整并行配置?

@YueXiao1995
Copy link
Author

YueXiao1995 commented Jan 6, 2025

您好,是不是没有调整并行配置?

是的没有调整,非常感谢,我把hf2mcore_qwen2.5_convertor.sh和run_mcore_qwen.sh的流水并行度PP都改成8,然后重重新跑模型转化、继续预训练,都跑通了。

但是现在又遇到一个新的问题,就是我想把训练好的Megatron-Core格式模型转换为HuggingFace格式,但是报错“not support yet”,具体在Pai-Megatron-Patch/toolkits/model_checkpoints_convertor/qwen
/hf2mcore_qwen2_dense_and_moe_gqa.py的第333行,
看了一下前面有个对并行度的条件判断,但目前我用的参数 tensor_model_parallel_size = 1,pipeline_model_parallel_size = 8,expert_model_parallel_size =1 恰好不支持,请问这种情况该怎么办?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants