Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to fix it ? training/cogvideox_text_to_video_lora.py FAILED #25

Open
2 tasks done
D-Mad opened this issue Oct 11, 2024 · 2 comments
Open
2 tasks done

how to fix it ? training/cogvideox_text_to_video_lora.py FAILED #25

D-Mad opened this issue Oct 11, 2024 · 2 comments

Comments

@D-Mad
Copy link

D-Mad commented Oct 11, 2024

System Info / 系統信息

cuda11.8
x2 3090
linux ubuntu 22.04 lts
pytorch2.4

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

andb: You can sync this run to the cloud by running:
wandb: wandb sync /home/dev_ml/cogvideox-factory/wandb/offline-run-20241011_154425-t76nveyh
wandb: Find logs at: wandb/offline-run-20241011_154425-t76nveyh/logs
[rank0]:I1011 15:44:57.956000 124307873129088 torch/_dynamo/utils.py:335] TorchDynamo compilation metrics:
[rank0]:I1011 15:44:57.956000 124307873129088 torch/_dynamo/utils.py:335] Function, Runtimes (s)
[rank0]:V1011 15:44:57.956000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats constrain_symbol_range: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:44:57.956000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats evaluate_expr: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0)
[rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _simplify_floor_div: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _maybe_guard_rel: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0)
[rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _find: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats has_hint: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0)
[rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats size_hint: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0)
[rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats simplify: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _update_divisible: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats replace: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _maybe_evaluate_static: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:44:57.958000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats get_implications: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:44:57.958000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats get_axioms: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:44:57.958000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats safe_expand: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0)
[rank0]:V1011 15:44:57.958000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats uninteresting_files: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
W1011 15:45:01.515000 129677780091520 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 177223 closing signal SIGTERM
E1011 15:45:02.282000 129677780091520 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 177222) of binary: /home/dev_ml/cogvideox-factory/venv/bin/python3.10
Traceback (most recent call last):
File "/home/dev_ml/cogvideox-factory/venv/bin/accelerate", line 8, in
sys.exit(main())
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

training/cogvideox_text_to_video_lora.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-10-11_15:45:01
host : W-ML-01
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 177222)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Expected behavior / 期待表现

how to fit it ?

@D-Mad
Copy link
Author

D-Mad commented Oct 11, 2024

./train_text_to_video_lora.sh
Running command: accelerate launch --config_file accelerate_configs/uncompiled_2.yaml --gpu_ids 0,1 training/cogvideox_text_to_video_lora.py --pretrained_model_name_or_path THUDM/CogVideoX-5b --data_root /home/dev_ml/cogvideox-factory/video-dataset-disney --caption_column prompt.txt --video_column videos.txt --id_token BW_STYLE --height_buckets 480 --width_buckets 720 --frame_buckets 49 --dataloader_num_workers 8 --pin_memory --validation_prompt "BW_STYLE A black and white animated scene unfolds with an anthropomorphic goat surrounded by musical notes and symbols, suggesting a playful environment. Mickey Mouse appears, leaning forward in curiosity as the goat remains still. The goat then engages with Mickey, who bends down to converse or react. The dynamics shift as Mickey grabs the goat, potentially in surprise or playfulness, amidst a minimalistic background. The scene captures the evolving relationship between the two characters in a whimsical, animated setting, emphasizing their interactions and emotions:::BW_STYLE A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance" --validation_prompt_separator ::: --num_validation_videos 1 --validation_epochs 10 --seed 42 --rank 128 --lora_alpha 128 --mixed_precision bf16 --output_dir /home/dev_ml/cogvideox-factory/cogvideox-lora__optimizer_adam__steps_3000__lr-schedule_cosine_with_restarts__learning-rate_1e-4/ --max_num_frames 49 --train_batch_size 1 --max_train_steps 3000 --checkpointing_steps 1000 --gradient_accumulation_steps 1 --gradient_checkpointing --learning_rate 1e-4 --lr_scheduler cosine_with_restarts --lr_warmup_steps 400 --lr_num_cycles 1 --enable_slicing --enable_tiling --optimizer adam --beta1 0.9 --beta2 0.95 --weight_decay 0.001 --max_grad_norm 1.0 --allow_tf32 --enable_model_cpu_offload --report_to wandb --nccl_timeout 1800
You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 8858.09it/s]
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 10686.12it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00, 5.47s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00, 5.46s/it]
Fetching 2 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 3232.60it/s]
{'use_learned_positional_embeddings'} was not found in config. Values will be initialized to default values.
Fetching 2 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 3766.78it/s]
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Tracking run with wandb version 0.18.3
wandb: W&B syncing is set to offline in this directory.
wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing.
===== Memory before training =====
memory_allocated=20.153 GB
max_memory_allocated=20.153 GB
max_memory_reserved=20.514 GB
***** Running training *****
Num trainable parameters = 132120576
Num examples = 69
Num epochs = 44
Instantaneous batch size per device = 1
Total train batch size (w. parallel, distributed & accumulation) = 2
Gradient accumulation steps = 1
Total optimization steps = 3000
Steps: 0%| | 0/3000 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/dev_ml/cogvideox-factory/training/cogvideox_text_to_video_lora.py", line 924, in
main(args)
File "/home/dev_ml/cogvideox-factory/training/cogvideox_text_to_video_lora.py", line 636, in main
latent_dist = vae.encode(videos).latent_dist
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
return method(self, *args, **kwargs)
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1222, in encode
h = self._encode(x)
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1181, in _encode
return self.tiled_encode(x)
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1348, in tiled_encode
tile, conv_cache = self.encoder(tile, conv_cache=conv_cache)
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 799, in forward
hidden_states, new_conv_cache[conv_cache_key] = down_block(
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 431, in forward
hidden_states, new_conv_cache[conv_cache_key] = resnet(
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 296, in forward
hidden_states, new_conv_cache["conv1"] = self.conv1(hidden_states, conv_cache=conv_cache.get("conv1"))
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 134, in forward
inputs = F.pad(inputs, padding_2d, mode="constant", value=0)
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/functional.py", line 4552, in pad
return torch._C._nn.pad(input, pad, mode, value)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 214.00 MiB. GPU 0 has a total capacity of 23.68 GiB of which 139.50 MiB is free. Process 144849 has 337.11 MiB memory in use. Including non-PyTorch memory, this process has 22.83 GiB memory in use. Of the allocated memory 21.49 GiB is allocated by PyTorch, and 907.43 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/dev_ml/cogvideox-factory/training/cogvideox_text_to_video_lora.py", line 924, in
[rank0]: main(args)
[rank0]: File "/home/dev_ml/cogvideox-factory/training/cogvideox_text_to_video_lora.py", line 636, in main
[rank0]: latent_dist = vae.encode(videos).latent_dist
[rank0]: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
[rank0]: return method(self, *args, **kwargs)
[rank0]: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1222, in encode
[rank0]: h = self._encode(x)
[rank0]: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1181, in _encode
[rank0]: return self.tiled_encode(x)
[rank0]: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1348, in tiled_encode
[rank0]: tile, conv_cache = self.encoder(tile, conv_cache=conv_cache)
[rank0]: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 799, in forward
[rank0]: hidden_states, new_conv_cache[conv_cache_key] = down_block(
[rank0]: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 431, in forward
[rank0]: hidden_states, new_conv_cache[conv_cache_key] = resnet(
[rank0]: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 296, in forward
[rank0]: hidden_states, new_conv_cache["conv1"] = self.conv1(hidden_states, conv_cache=conv_cache.get("conv1"))
[rank0]: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 134, in forward
[rank0]: inputs = F.pad(inputs, padding_2d, mode="constant", value=0)
[rank0]: File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/nn/functional.py", line 4552, in pad
[rank0]: return torch._C._nn.pad(input, pad, mode, value)
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 214.00 MiB. GPU 0 has a total capacity of 23.68 GiB of which 139.50 MiB is free. Process 144849 has 337.11 MiB memory in use. Including non-PyTorch memory, this process has 22.83 GiB memory in use. Of the allocated memory 21.49 GiB is allocated by PyTorch, and 907.43 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/dev_ml/cogvideox-factory/wandb/offline-run-20241011_155115-v4hkz2vc
wandb: Find logs at: wandb/offline-run-20241011_155115-v4hkz2vc/logs
[rank0]:I1011 15:51:47.497000 139958949089920 torch/_dynamo/utils.py:335] TorchDynamo compilation metrics:
[rank0]:I1011 15:51:47.497000 139958949089920 torch/_dynamo/utils.py:335] Function, Runtimes (s)
[rank0]:V1011 15:51:47.498000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats constrain_symbol_range: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:51:47.498000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats evaluate_expr: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0)
[rank0]:V1011 15:51:47.498000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _simplify_floor_div: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:51:47.498000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _maybe_guard_rel: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0)
[rank0]:V1011 15:51:47.499000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _find: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:51:47.499000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats has_hint: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0)
[rank0]:V1011 15:51:47.499000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats size_hint: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0)
[rank0]:V1011 15:51:47.499000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats simplify: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:51:47.499000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _update_divisible: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:51:47.499000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats replace: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:51:47.499000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _maybe_evaluate_static: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:51:47.499000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats get_implications: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:51:47.499000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats get_axioms: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:51:47.499000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats safe_expand: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0)
[rank0]:V1011 15:51:47.500000 139958949089920 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats uninteresting_files: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
W1011 15:51:49.320000 135937568105088 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 186206 closing signal SIGTERM
E1011 15:51:50.136000 135937568105088 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 186205) of binary: /home/dev_ml/cogvideox-factory/venv/bin/python3.10
Traceback (most recent call last):
File "/home/dev_ml/cogvideox-factory/venv/bin/accelerate", line 8, in
sys.exit(main())
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

training/cogvideox_text_to_video_lora.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-10-11_15:51:49
host : W-ML-01
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 186205)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

add more information ternimal

@a-r-r-o-w
Copy link
Owner

Have you run prepare_dataset.py before running training? If you don't run it, it is not possible to train in under 24 GB. This is because you end up loading the text encoder and VAE, and VAE encode/decode can take additional ~5 GB on top of the models weights memory.

If you prepare the dataset by precomputing latents and prompt embeddings first, you should be able to reproduce the memory numbers we report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants