You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 28.37 GiB. GPU 5 has a total capacity of 31.75 GiB of which 2.65 GiB is free. Process 13924 has 29.10 GiB memory in use. Of the allocated memory 28.42 GiB is allocated by PyTorch, and 17.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
更为多的的报错:
[rank5]: Traceback (most recent call last):
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/examples/qwen2_5/../qwen2/pretrain_qwen.py", line 278, in
[rank5]: pretrain(
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/training/training.py", line 266, in pretrain
[rank5]: model, optimizer, opt_param_scheduler = setup_model_and_optimizer(model_provider, model_type)
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/training/training.py", line 600, in setup_model_and_optimizer
[rank5]: model = get_model(model_provider_func, model_type)
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/training/training.py", line 516, in get_model
[rank5]: model = [
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/training/training.py", line 517, in
[rank5]: DDP(
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/core/distributed/distributed_data_parallel.py", line 161, in init
[rank5]: self.buffers = allocate_buffers_for_parameters(
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/core/distributed/distributed_data_parallel.py", line 130, in allocate_buffers_for_parameters
[rank5]: ParamAndGradBuffer(
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/core/distributed/param_and_grad_buffer.py", line 366, in init
[rank5]: self.grad_data = torch.zeros(
[rank5]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 28.37 GiB. GPU 5 has a total capacity of 31.75 GiB of which 2.65 GiB is free. Process 13924 has 29.10 GiB memory in use. Of the allocated memory 28.42 GiB is allocated by PyTorch, and 17.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
计算资源挺贵的,看哪个大佬能给看下这是啥问题
The text was updated successfully, but these errors were encountered:
在阿里的PAI平台,尝试对Qwen-2.5 进行继续预训练,参考Pai-Megatron-Patch/blob/main/examples/qwen2_5/README.md
先参考文档用单卡对0.5B的模型进行继续预训练,可以跑通。
但后续增加资源到8卡,对7B模型继续预训练时却失败了**,每一张卡显存都满了,但是用pid去查占用资源的进程又不存在,应该是代码运行过程过产生的。
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 28.37 GiB. GPU 5 has a total capacity of 31.75 GiB of which 2.65 GiB is free. Process 13924 has 29.10 GiB memory in use. Of the allocated memory 28.42 GiB is allocated by PyTorch, and 17.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
更为多的的报错:
[rank5]: Traceback (most recent call last):
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/examples/qwen2_5/../qwen2/pretrain_qwen.py", line 278, in
[rank5]: pretrain(
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/training/training.py", line 266, in pretrain
[rank5]: model, optimizer, opt_param_scheduler = setup_model_and_optimizer(model_provider, model_type)
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/training/training.py", line 600, in setup_model_and_optimizer
[rank5]: model = get_model(model_provider_func, model_type)
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/training/training.py", line 516, in get_model
[rank5]: model = [
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/training/training.py", line 517, in
[rank5]: DDP(
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/core/distributed/distributed_data_parallel.py", line 161, in init
[rank5]: self.buffers = allocate_buffers_for_parameters(
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/core/distributed/distributed_data_parallel.py", line 130, in allocate_buffers_for_parameters
[rank5]: ParamAndGradBuffer(
[rank5]: File "/mnt/workspace/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/core/distributed/param_and_grad_buffer.py", line 366, in init
[rank5]: self.grad_data = torch.zeros(
[rank5]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 28.37 GiB. GPU 5 has a total capacity of 31.75 GiB of which 2.65 GiB is free. Process 13924 has 29.10 GiB memory in use. Of the allocated memory 28.42 GiB is allocated by PyTorch, and 17.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
计算资源挺贵的,看哪个大佬能给看下这是啥问题
The text was updated successfully, but these errors were encountered: