You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Python 3.10.15
Torch 2.5.1
A6000(48G*4)
CUDA Version 12.2
DeepSpeed 0.15.4
Accelerate 1.1.1
Description
While running train_control.py, the process fails during the final checkpoint saving step. The logs indicate that an NCCL ALLREDUCE operation timed out after running for approximately 30 minutes. As a result, the process is terminated with a DistBackendError.
Error Logs
`Steps: 100%|████████████████████████████████████████████| 6928/6928 [37:28<00:00, 11.45s/it, lr=2e-5, step_loss=0.125]
12/02/2024 16:03:31 - INFO - accelerate.accelerator - Saving current state to output_dir/checkpoint-6928
12/02/2024 16:03:31 - INFO - accelerate.accelerator - Saving DeepSpeed Model and Optimizer
[rank0]:[E1202 16:33:31.475880828 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: Wor
kNCCL(SeqNum=140, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800093 milliseconds before ti
ming out.
[rank0]:[E1202 16:33:31.828915902 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception (either
an error or timeout) detected by watchdog at work: 140, last enqueued NCCL work: 140, last completed NCCL work: 139.
[2024-12-02 16:33:32,111] [INFO] [logging.py:128:log_dist] [Rank 0] [Torch] Checkpoint pytorch_model is about to be sa
ved!
[rank0]:[E1202 16:33:32.374203239 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 0] Timeout at NCCL wo
rk: 140, last enqueued NCCL work: 140, last completed NCCL work: 139.
[rank0]:[E1202 16:33:32.374226255 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Du
e to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E1202 16:33:32.374231830 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the en
tire process down.
[rank0]:[E1202 16:33:32.388427390 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watc
hdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=140, OpT
ype=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800093 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first)
:
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6e7d65e446 in /data2/akko/anaconda3/envs/cog
videox/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000
l> > >) + 0x282 (0x7f6e7e971772 in /data2/akko/anaconda3/envs/cogvideox/lib/python3.10/site-packages/torch/lib/libtorc
h_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f6e7e978bb3 in /data2/akko/anaconda3/envs/cogvideox/li
b/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f6e7e97a61d in /data2/akko/anaconda3/envs/cogvideox/l
ib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f6ec73075c0 in /data2/akko/anaconda3/envs/cogvideox/lib/python3.10/site-pac
kages/torch/lib/libtorch.so)
frame #5: + 0x8609 (0x7f6eca106609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f6ec9ed1353 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Wa
tchdog caught collective operation timeout: WorkNCCL(SeqNum=140, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=
1800000) ran for 1800093 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first)
:
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6e7d65e446 in /data2/akko/anaconda3/envs/cog
videox/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000
l> > >) + 0x282 (0x7f6e7e971772 in /data2/akko/anaconda3/envs/cogvideox/lib/python3.10/site-packages/torch/lib/libtorc
h_cuda.so)`
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f6e7e978bb3 in /data2/akko/anaconda3/envs/cogvideox/li
b/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f6e7e97a61d in /data2/akko/anaconda3/envs/cogvideox/l
ib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f6ec73075c0 in /data2/akko/anaconda3/envs/cogvideox/lib/python3.10/site-pac
kages/torch/lib/libtorch.so)
frame #5: + 0x8609 (0x7f6eca106609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f6ec9ed1353 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call f
irst):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6e7d65e446 in /data2/akko/anaconda3/envs/cog
videox/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xe4271b (0x7f6e7e5e771b in /data2/akko/anaconda3/envs/cogvideox/lib/python3.10/site-pa
ckages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x7f6ec73075c0 in /data2/akko/anaconda3/envs/cogvideox/lib/python3.10/site-pac
kages/torch/lib/libtorch.so)
frame #3: + 0x8609 (0x7f6eca106609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f6ec9ed1353 in /lib/x86_64-linux-gnu/libc.so.6)
The text was updated successfully, but these errors were encountered:
Python 3.10.15
Torch 2.5.1
A6000(48G*4)
CUDA Version 12.2
DeepSpeed 0.15.4
Accelerate 1.1.1
While running train_control.py, the process fails during the final checkpoint saving step. The logs indicate that an NCCL ALLREDUCE operation timed out after running for approximately 30 minutes. As a result, the process is terminated with a DistBackendError.
`Steps: 100%|████████████████████████████████████████████| 6928/6928 [37:28<00:00, 11.45s/it, lr=2e-5, step_loss=0.125]
12/02/2024 16:03:31 - INFO - accelerate.accelerator - Saving current state to output_dir/checkpoint-6928
12/02/2024 16:03:31 - INFO - accelerate.accelerator - Saving DeepSpeed Model and Optimizer
[rank0]:[E1202 16:33:31.475880828 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: Wor
kNCCL(SeqNum=140, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800093 milliseconds before ti
ming out.
[rank0]:[E1202 16:33:31.828915902 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception (either
an error or timeout) detected by watchdog at work: 140, last enqueued NCCL work: 140, last completed NCCL work: 139.
[2024-12-02 16:33:32,111] [INFO] [logging.py:128:log_dist] [Rank 0] [Torch] Checkpoint pytorch_model is about to be sa
ved!
[rank0]:[E1202 16:33:32.374203239 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 0] Timeout at NCCL wo
rk: 140, last enqueued NCCL work: 140, last completed NCCL work: 139.
[rank0]:[E1202 16:33:32.374226255 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Du
e to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E1202 16:33:32.374231830 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the en
tire process down.
[rank0]:[E1202 16:33:32.388427390 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watc
hdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=140, OpT
ype=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800093 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first)
:
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6e7d65e446 in /data2/akko/anaconda3/envs/cog
videox/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000
l> > >) + 0x282 (0x7f6e7e971772 in /data2/akko/anaconda3/envs/cogvideox/lib/python3.10/site-packages/torch/lib/libtorc
h_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f6e7e978bb3 in /data2/akko/anaconda3/envs/cogvideox/li
b/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f6e7e97a61d in /data2/akko/anaconda3/envs/cogvideox/l
ib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f6ec73075c0 in /data2/akko/anaconda3/envs/cogvideox/lib/python3.10/site-pac
kages/torch/lib/libtorch.so)
frame #5: + 0x8609 (0x7f6eca106609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f6ec9ed1353 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Wa
tchdog caught collective operation timeout: WorkNCCL(SeqNum=140, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=
1800000) ran for 1800093 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first)
:
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6e7d65e446 in /data2/akko/anaconda3/envs/cog
videox/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000
l> > >) + 0x282 (0x7f6e7e971772 in /data2/akko/anaconda3/envs/cogvideox/lib/python3.10/site-packages/torch/lib/libtorc
h_cuda.so)`
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f6e7e978bb3 in /data2/akko/anaconda3/envs/cogvideox/li
b/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f6e7e97a61d in /data2/akko/anaconda3/envs/cogvideox/l
ib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f6ec73075c0 in /data2/akko/anaconda3/envs/cogvideox/lib/python3.10/site-pac
kages/torch/lib/libtorch.so)
frame #5: + 0x8609 (0x7f6eca106609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f6ec9ed1353 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call f
irst):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6e7d65e446 in /data2/akko/anaconda3/envs/cog
videox/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xe4271b (0x7f6e7e5e771b in /data2/akko/anaconda3/envs/cogvideox/lib/python3.10/site-pa
ckages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x7f6ec73075c0 in /data2/akko/anaconda3/envs/cogvideox/lib/python3.10/site-pac
kages/torch/lib/libtorch.so)
frame #3: + 0x8609 (0x7f6eca106609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f6ec9ed1353 in /lib/x86_64-linux-gnu/libc.so.6)
The text was updated successfully, but these errors were encountered: