微调过程中显存不断增加 #215

Liu98C · 2025-01-18T09:03:19Z

`Traceback (most recent call last):
File "/home/bit118/projectDir/Video-LLaVA/videollava/train/train_mem.py", line 13, in
train()
File "/home/bit118/projectDir/Video-LLaVA/videollava/train/train.py", line 1074, in train
trainer.train()
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/transformers/trainer.py", line 2654, in training_step
loss = self.compute_loss(model, inputs)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/transformers/trainer.py", line 2679, in compute_loss
outputs = model(**inputs)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1899, in forward
loss = self.module(*inputs, **kwargs)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/peft/peft_model.py", line 922, in forward
return self.base_model(
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/bit118/projectDir/Video-LLaVA/videollava/model/language_model/llava_llama.py", line 87, in forward
return super().forward(
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 838, in forward
loss = loss_fct(shift_logits, shift_labels)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1174, in forward
return F.cross_entropy(input, target, weight=self.weight,
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, Reduction.get_enum(reduction), ignore_index, label_smoothing)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.67 GiB (GPU 0; 47.53 GiB total capacity; 35.22 GiB already allocated; 2.94 GiB free; 44.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
0%|▏ | 23/6931 [06:15<31:20:15, 16.33s/it]
[2025-01-18 16:31:41,670] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 3066686
[2025-01-18 16:31:41,670] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 3066687
[2025-01-18 16:31:42,468] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 3066688
[2025-01-18 16:31:43,263] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 3066689
[2025-01-18 16:31:44,017] [ERROR] [launch.py:325:sigkill_handler] ['/home/bit118/anaconda3/envs/videollava/bin/python3.10', '-u', 'videollava/train/train_mem.py', '--local_rank=3', '--lora_enable', 'True', '--lora_r', '128', '--lora_alpha', '256', '--mm_projector_lr', '2e-5', '--deepspeed', './scripts/zero2_offload.json', '--model_name_or_path', '/home/bit118/data/modelDir/vicuna-7b-v1.5', '--version', 'v1', '--data_path', '/home/bit118/projectDir/VL-VAD/train_json/llava_image_tune.json', '/home/bit118/projectDir/VL-VAD/train_json/nlp_tune.json', '--image_folder', '/home/bit118/data/datasetDir/llava_dataset', '--image_tower', '/home/bit118/modelDir/LanguageBind_Image', '--video_tower', '/home/bit118/modelDir/LanguageBind_Video_merge', '--mm_projector_type', 'mlp2x_gelu', '--pretrain_mm_mlp_adapter', '/home/bit118/modelDir/Video-LLaVA-Pretrain-7B/mm_projector.bin', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--image_aspect_ratio', 'pad', '--group_by_modality_length', 'True', '--bf16', 'True', '--output_dir', './checkpoints/videollava-7b-lora', '--num_train_epochs', '1', '--per_device_train_batch_size', '24', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '50000', '--save_total_limit', '1', '--learning_rate', '2e-4', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--tokenizer_model_max_length', '3072', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True', '--report_to', 'tensorboard', '--cache_dir', './cache_dir'] exits with return code = 1

`
为什么训练的过程中占用显存会不断增加，直到爆显存？请问如何解决

Liu98C changed the title ~~训练中显存不断增加~~ 微调过程中显存不断增加 Jan 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

微调过程中显存不断增加 #215

微调过程中显存不断增加 #215

Liu98C commented Jan 18, 2025

微调过程中显存不断增加 #215

微调过程中显存不断增加 #215

Comments

Liu98C commented Jan 18, 2025