We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
`Traceback (most recent call last): File "/home/bit118/projectDir/Video-LLaVA/videollava/train/train_mem.py", line 13, in train() File "/home/bit118/projectDir/Video-LLaVA/videollava/train/train.py", line 1074, in train trainer.train() File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/transformers/trainer.py", line 2654, in training_step loss = self.compute_loss(model, inputs) File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/transformers/trainer.py", line 2679, in compute_loss outputs = model(**inputs) File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn ret_val = func(*args, **kwargs) File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1899, in forward loss = self.module(*inputs, **kwargs) File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/peft/peft_model.py", line 922, in forward return self.base_model( File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/bit118/projectDir/Video-LLaVA/videollava/model/language_model/llava_llama.py", line 87, in forward return super().forward( File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 838, in forward loss = loss_fct(shift_logits, shift_labels) File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1174, in forward return F.cross_entropy(input, target, weight=self.weight, File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/torch/nn/functional.py", line 3029, in cross_entropy return torch._C._nn.cross_entropy_loss(input, target, weight, Reduction.get_enum(reduction), ignore_index, label_smoothing) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.67 GiB (GPU 0; 47.53 GiB total capacity; 35.22 GiB already allocated; 2.94 GiB free; 44.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 0%|▏ | 23/6931 [06:15<31:20:15, 16.33s/it] [2025-01-18 16:31:41,670] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 3066686 [2025-01-18 16:31:41,670] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 3066687 [2025-01-18 16:31:42,468] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 3066688 [2025-01-18 16:31:43,263] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 3066689 [2025-01-18 16:31:44,017] [ERROR] [launch.py:325:sigkill_handler] ['/home/bit118/anaconda3/envs/videollava/bin/python3.10', '-u', 'videollava/train/train_mem.py', '--local_rank=3', '--lora_enable', 'True', '--lora_r', '128', '--lora_alpha', '256', '--mm_projector_lr', '2e-5', '--deepspeed', './scripts/zero2_offload.json', '--model_name_or_path', '/home/bit118/data/modelDir/vicuna-7b-v1.5', '--version', 'v1', '--data_path', '/home/bit118/projectDir/VL-VAD/train_json/llava_image_tune.json', '/home/bit118/projectDir/VL-VAD/train_json/nlp_tune.json', '--image_folder', '/home/bit118/data/datasetDir/llava_dataset', '--image_tower', '/home/bit118/modelDir/LanguageBind_Image', '--video_tower', '/home/bit118/modelDir/LanguageBind_Video_merge', '--mm_projector_type', 'mlp2x_gelu', '--pretrain_mm_mlp_adapter', '/home/bit118/modelDir/Video-LLaVA-Pretrain-7B/mm_projector.bin', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--image_aspect_ratio', 'pad', '--group_by_modality_length', 'True', '--bf16', 'True', '--output_dir', './checkpoints/videollava-7b-lora', '--num_train_epochs', '1', '--per_device_train_batch_size', '24', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '50000', '--save_total_limit', '1', '--learning_rate', '2e-4', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--tokenizer_model_max_length', '3072', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True', '--report_to', 'tensorboard', '--cache_dir', './cache_dir'] exits with return code = 1
` 为什么训练的过程中占用显存会不断增加,直到爆显存?请问如何解决
The text was updated successfully, but these errors were encountered:
No branches or pull requests
`Traceback (most recent call last):
File "/home/bit118/projectDir/Video-LLaVA/videollava/train/train_mem.py", line 13, in
train()
File "/home/bit118/projectDir/Video-LLaVA/videollava/train/train.py", line 1074, in train
trainer.train()
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/transformers/trainer.py", line 2654, in training_step
loss = self.compute_loss(model, inputs)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/transformers/trainer.py", line 2679, in compute_loss
outputs = model(**inputs)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1899, in forward
loss = self.module(*inputs, **kwargs)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/peft/peft_model.py", line 922, in forward
return self.base_model(
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/bit118/projectDir/Video-LLaVA/videollava/model/language_model/llava_llama.py", line 87, in forward
return super().forward(
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 838, in forward
loss = loss_fct(shift_logits, shift_labels)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1174, in forward
return F.cross_entropy(input, target, weight=self.weight,
File "/home/bit118/anaconda3/envs/videollava/lib/python3.10/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, Reduction.get_enum(reduction), ignore_index, label_smoothing)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.67 GiB (GPU 0; 47.53 GiB total capacity; 35.22 GiB already allocated; 2.94 GiB free; 44.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
0%|▏ | 23/6931 [06:15<31:20:15, 16.33s/it]
[2025-01-18 16:31:41,670] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 3066686
[2025-01-18 16:31:41,670] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 3066687
[2025-01-18 16:31:42,468] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 3066688
[2025-01-18 16:31:43,263] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 3066689
[2025-01-18 16:31:44,017] [ERROR] [launch.py:325:sigkill_handler] ['/home/bit118/anaconda3/envs/videollava/bin/python3.10', '-u', 'videollava/train/train_mem.py', '--local_rank=3', '--lora_enable', 'True', '--lora_r', '128', '--lora_alpha', '256', '--mm_projector_lr', '2e-5', '--deepspeed', './scripts/zero2_offload.json', '--model_name_or_path', '/home/bit118/data/modelDir/vicuna-7b-v1.5', '--version', 'v1', '--data_path', '/home/bit118/projectDir/VL-VAD/train_json/llava_image_tune.json', '/home/bit118/projectDir/VL-VAD/train_json/nlp_tune.json', '--image_folder', '/home/bit118/data/datasetDir/llava_dataset', '--image_tower', '/home/bit118/modelDir/LanguageBind_Image', '--video_tower', '/home/bit118/modelDir/LanguageBind_Video_merge', '--mm_projector_type', 'mlp2x_gelu', '--pretrain_mm_mlp_adapter', '/home/bit118/modelDir/Video-LLaVA-Pretrain-7B/mm_projector.bin', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--image_aspect_ratio', 'pad', '--group_by_modality_length', 'True', '--bf16', 'True', '--output_dir', './checkpoints/videollava-7b-lora', '--num_train_epochs', '1', '--per_device_train_batch_size', '24', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '50000', '--save_total_limit', '1', '--learning_rate', '2e-4', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--tokenizer_model_max_length', '3072', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True', '--report_to', 'tensorboard', '--cache_dir', './cache_dir'] exits with return code = 1
`
为什么训练的过程中占用显存会不断增加,直到爆显存?请问如何解决
The text was updated successfully, but these errors were encountered: