Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

微调过程中显存不断增加 #215

Open
Liu98C opened this issue Jan 18, 2025 · 0 comments
Open

微调过程中显存不断增加 #215

Liu98C opened this issue Jan 18, 2025 · 0 comments

Comments

@Liu98C
Copy link

Liu98C commented Jan 18, 2025

`Traceback (most recent call last):
File "/home/bit118/projectDir/Video-LLaVA/videollava/train/train_mem.py", line 13, in
train()
File "/home/bit118/projectDir/Video-LLaVA/videollava/train/train.py", line 1074, in train
trainer.train()
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/transformers/trainer.py", line 2654, in training_step
loss = self.compute_loss(model, inputs)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/transformers/trainer.py", line 2679, in compute_loss
outputs = model(**inputs)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1899, in forward
loss = self.module(*inputs, **kwargs)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/peft/peft_model.py", line 922, in forward
return self.base_model(
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/bit118/projectDir/Video-LLaVA/videollava/model/language_model/llava_llama.py", line 87, in forward
return super().forward(
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 838, in forward
loss = loss_fct(shift_logits, shift_labels)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1174, in forward
return F.cross_entropy(input, target, weight=self.weight,
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, Reduction.get_enum(reduction), ignore_index, label_smoothing)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.67 GiB (GPU 0; 47.53 GiB total capacity; 35.22 GiB already allocated; 2.94 GiB free; 44.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
0%|▏ | 23/6931 [06:15<31:20:15, 16.33s/it]
[2025-01-18 16:31:41,670] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 3066686
[2025-01-18 16:31:41,670] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 3066687
[2025-01-18 16:31:42,468] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 3066688
[2025-01-18 16:31:43,263] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 3066689
[2025-01-18 16:31:44,017] [ERROR] [launch.py:325:sigkill_handler] ['/home/bit118/anaconda3/envs/videollava/bin/python3.10', '-u', 'videollava/train/train_mem.py', '--local_rank=3', '--lora_enable', 'True', '--lora_r', '128', '--lora_alpha', '256', '--mm_projector_lr', '2e-5', '--deepspeed', './scripts/zero2_offload.json', '--model_name_or_path', '/home/bit118/data/modelDir/vicuna-7b-v1.5', '--version', 'v1', '--data_path', '/home/bit118/projectDir/VL-VAD/train_json/llava_image_tune
.json', '/home/bit118/projectDir/VL-VAD/train_json/nlp_tune.json', '--image_folder', '/home/bit118/data/datasetDir/llava_dataset', '--image_tower', '/home/bit118/modelDir/LanguageBind_Image', '--video_tower', '/home/bit118/modelDir/LanguageBind_Video_merge', '--mm_projector_type', 'mlp2x_gelu', '--pretrain_mm_mlp_adapter', '/home/bit118/modelDir/Video-LLaVA-Pretrain-7B/mm_projector.bin', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--image_aspect_ratio', 'pad', '--group_by_modality_length', 'True', '--bf16', 'True', '--output_dir', './checkpoints/videollava-7b-lora', '--num_train_epochs', '1', '--per_device_train_batch_size', '24', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '50000', '--save_total_limit', '1', '--learning_rate', '2e-4', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--tokenizer_model_max_length', '3072', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True', '--report_to', 'tensorboard', '--cache_dir', './cache_dir'] exits with return code = 1

`
为什么训练的过程中占用显存会不断增加,直到爆显存?请问如何解决

@Liu98C Liu98C changed the title 训练中显存不断增加 微调过程中显存不断增加 Jan 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant