Qwen2.5-7B 无法进行继续预训练 #409

YueXiao1995 · 2024-12-29T18:10:19Z

在阿里的PAI平台，尝试对Qwen-2.5 进行继续预训练，参考Pai-Megatron-Patch/blob/main/examples/qwen2_5/README.md
先参考文档用单卡对0.5B的模型进行继续预训练，可以跑通。
但后续增加资源到8卡，对7B模型继续预训练时却失败了**，每一张卡显存都满了，但是用pid去查占用资源的进程又不存在，应该是代码运行过程过产生的。

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 28.37 GiB. GPU 5 has a total capacity of 31.75 GiB of which 2.65 GiB is free. Process 13924 has 29.10 GiB memory in use. Of the allocated memory 28.42 GiB is allocated by PyTorch, and 17.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

更为多的的报错：
[rank5]: Traceback (most recent call last):
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/examples/qwen2_5/../qwen2/pretrain_qwen.py", line 278, in
[rank5]: pretrain(
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/training/training.py", line 266, in pretrain
[rank5]: model, optimizer, opt_param_scheduler = setup_model_and_optimizer(model_provider, model_type)
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/training/training.py", line 600, in setup_model_and_optimizer
[rank5]: model = get_model(model_provider_func, model_type)
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/training/training.py", line 516, in get_model
[rank5]: model = [
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/training/training.py", line 517, in
[rank5]: DDP(
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/core/distributed/distributed_data_parallel.py", line 161, in init
[rank5]: self.buffers = allocate_buffers_for_parameters(
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/core/distributed/distributed_data_parallel.py", line 130, in allocate_buffers_for_parameters
[rank5]: ParamAndGradBuffer(
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/core/distributed/param_and_grad_buffer.py", line 366, in init
[rank5]: self.grad_data = torch.zeros(
[rank5]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 28.37 GiB. GPU 5 has a total capacity of 31.75 GiB of which 2.65 GiB is free. Process 13924 has 29.10 GiB memory in use. Of the allocated memory 28.42 GiB is allocated by PyTorch, and 17.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

计算资源挺贵的，看哪个大佬能给看下这是啥问题

lostkevin · 2025-01-02T01:49:38Z

您好，是不是没有调整并行配置？

YueXiao1995 · 2025-01-06T09:24:08Z

您好，是不是没有调整并行配置？

是的没有调整，非常感谢，我把hf2mcore_qwen2.5_convertor.sh和run_mcore_qwen.sh的流水并行度PP都改成8，然后重重新跑模型转化、继续预训练，都跑通了。

但是现在又遇到一个新的问题，就是我想把训练好的Megatron-Core格式模型转换为HuggingFace格式，但是报错“not support yet”，具体在Pai-Megatron-Patch/toolkits/model_checkpoints_convertor/qwen
/hf2mcore_qwen2_dense_and_moe_gqa.py的第333行，
看了一下前面有个对并行度的条件判断，但目前我用的参数 tensor_model_parallel_size = 1，pipeline_model_parallel_size = 8，expert_model_parallel_size =1 恰好不支持，请问这种情况该怎么办？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen2.5-7B 无法进行继续预训练 #409

Qwen2.5-7B 无法进行继续预训练 #409

YueXiao1995 commented Dec 29, 2024 •

edited

Loading

lostkevin commented Jan 2, 2025

YueXiao1995 commented Jan 6, 2025 •

edited

Loading

Qwen2.5-7B 无法进行继续预训练 #409

Qwen2.5-7B 无法进行继续预训练 #409

Comments

YueXiao1995 commented Dec 29, 2024 • edited Loading

lostkevin commented Jan 2, 2025

YueXiao1995 commented Jan 6, 2025 • edited Loading

YueXiao1995 commented Dec 29, 2024 •

edited

Loading

YueXiao1995 commented Jan 6, 2025 •

edited

Loading