Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to Save Checkpoint at Final Training Step. #89

Open
WuNvRenQiaoDaiMaZiRanShen opened this issue Dec 2, 2024 · 1 comment
Open

Comments

@WuNvRenQiaoDaiMaZiRanShen
  • Version

Python 3.10.15
Torch 2.5.1
A6000(48G*4)
CUDA Version 12.2
DeepSpeed 0.15.4
Accelerate 1.1.1

  • Description

While running train_control.py, the process fails during the final checkpoint saving step. The logs indicate that an NCCL ALLREDUCE operation timed out after running for approximately 30 minutes. As a result, the process is terminated with a DistBackendError.

  • Error Logs

`Steps: 100%|████████████████████████████████████████████| 6928/6928 [37:28<00:00, 11.45s/it, lr=2e-5, step_loss=0.125]
12/02/2024 16:03:31 - INFO - accelerate.accelerator - Saving current state to output_dir/checkpoint-6928
12/02/2024 16:03:31 - INFO - accelerate.accelerator - Saving DeepSpeed Model and Optimizer
[rank0]:[E1202 16:33:31.475880828 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: Wor
kNCCL(SeqNum=140, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800093 milliseconds before ti
ming out.
[rank0]:[E1202 16:33:31.828915902 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception (either
an error or timeout) detected by watchdog at work: 140, last enqueued NCCL work: 140, last completed NCCL work: 139.
[2024-12-02 16:33:32,111] [INFO] [logging.py:128:log_dist] [Rank 0] [Torch] Checkpoint pytorch_model is about to be sa
ved!
[rank0]:[E1202 16:33:32.374203239 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 0] Timeout at NCCL wo
rk: 140, last enqueued NCCL work: 140, last completed NCCL work: 139.
[rank0]:[E1202 16:33:32.374226255 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Du
e to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E1202 16:33:32.374231830 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the en
tire process down.
[rank0]:[E1202 16:33:32.388427390 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watc
hdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=140, OpT
ype=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800093 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first)
:
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6e7d65e446 in /data2/akko/anaconda3/envs/cog
videox/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000
l> > >) + 0x282 (0x7f6e7e971772 in /data2/akko/anaconda3/envs/cogvideox/lib/python3.10/site-packages/torch/lib/libtorc
h_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f6e7e978bb3 in /data2/akko/anaconda3/envs/cogvideox/li
b/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f6e7e97a61d in /data2/akko/anaconda3/envs/cogvideox/l
ib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f6ec73075c0 in /data2/akko/anaconda3/envs/cogvideox/lib/python3.10/site-pac
kages/torch/lib/libtorch.so)
frame #5: + 0x8609 (0x7f6eca106609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f6ec9ed1353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Wa
tchdog caught collective operation timeout: WorkNCCL(SeqNum=140, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=
1800000) ran for 1800093 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first)
:
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6e7d65e446 in /data2/akko/anaconda3/envs/cog
videox/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000
l> > >) + 0x282 (0x7f6e7e971772 in /data2/akko/anaconda3/envs/cogvideox/lib/python3.10/site-packages/torch/lib/libtorc
h_cuda.so)`
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f6e7e978bb3 in /data2/akko/anaconda3/envs/cogvideox/li
b/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f6e7e97a61d in /data2/akko/anaconda3/envs/cogvideox/l
ib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f6ec73075c0 in /data2/akko/anaconda3/envs/cogvideox/lib/python3.10/site-pac
kages/torch/lib/libtorch.so)
frame #5: + 0x8609 (0x7f6eca106609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f6ec9ed1353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call f
irst):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6e7d65e446 in /data2/akko/anaconda3/envs/cog
videox/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xe4271b (0x7f6e7e5e771b in /data2/akko/anaconda3/envs/cogvideox/lib/python3.10/site-pa
ckages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x7f6ec73075c0 in /data2/akko/anaconda3/envs/cogvideox/lib/python3.10/site-pac
kages/torch/lib/libtorch.so)
frame #3: + 0x8609 (0x7f6eca106609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f6ec9ed1353 in /lib/x86_64-linux-gnu/libc.so.6)

@bubbliiiing
Copy link
Collaborator

This looks like the last weight has been saved, but it seems like there was a problem with the program ending at the end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants