-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to fix it ? training/cogvideox_text_to_video_lora.py FAILED #25
Comments
./train_text_to_video_lora.sh
|
Have you run If you prepare the dataset by precomputing latents and prompt embeddings first, you should be able to reproduce the memory numbers we report. |
System Info / 系統信息
cuda11.8
x2 3090
linux ubuntu 22.04 lts
pytorch2.4
Information / 问题信息
Reproduction / 复现过程
andb: You can sync this run to the cloud by running:
wandb: wandb sync /home/dev_ml/cogvideox-factory/wandb/offline-run-20241011_154425-t76nveyh
wandb: Find logs at: wandb/offline-run-20241011_154425-t76nveyh/logs
[rank0]:I1011 15:44:57.956000 124307873129088 torch/_dynamo/utils.py:335] TorchDynamo compilation metrics:
[rank0]:I1011 15:44:57.956000 124307873129088 torch/_dynamo/utils.py:335] Function, Runtimes (s)
[rank0]:V1011 15:44:57.956000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats constrain_symbol_range: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:44:57.956000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats evaluate_expr: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0)
[rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _simplify_floor_div: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _maybe_guard_rel: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0)
[rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _find: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats has_hint: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0)
[rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats size_hint: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0)
[rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats simplify: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _update_divisible: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats replace: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:44:57.957000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats _maybe_evaluate_static: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:44:57.958000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats get_implications: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:44:57.958000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats get_axioms: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
[rank0]:V1011 15:44:57.958000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats safe_expand: CacheInfo(hits=0, misses=0, maxsize=256, currsize=0)
[rank0]:V1011 15:44:57.958000 124307873129088 torch/fx/experimental/symbolic_shapes.py:116] lru_cache_stats uninteresting_files: CacheInfo(hits=0, misses=0, maxsize=None, currsize=0)
W1011 15:45:01.515000 129677780091520 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 177223 closing signal SIGTERM
E1011 15:45:02.282000 129677780091520 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 177222) of binary: /home/dev_ml/cogvideox-factory/venv/bin/python3.10
Traceback (most recent call last):
File "/home/dev_ml/cogvideox-factory/venv/bin/accelerate", line 8, in
sys.exit(main())
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/dev_ml/cogvideox-factory/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
training/cogvideox_text_to_video_lora.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-10-11_15:45:01
host : W-ML-01
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 177222)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Expected behavior / 期待表现
how to fit it ?
The text was updated successfully, but these errors were encountered: