Stuck on training #63

aaghawaheed · 2022-12-26T08:07:08Z

When I start training on multiple gpus it stuck, you can check in screenshot

aaghawaheed · 2022-12-26T08:14:59Z

sh configs/r50_motr_train.sh
/home/user/anaconda3/envs/motr2/lib/python3.8/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

| distributed init (rank 0): env://
| distributed init (rank 1): env://

[E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809347 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809351 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 118400) of binary: /home/user/anaconda3/envs/motr2/bin/python3
Traceback (most recent call last):
File "/home/user/anaconda3/envs/motr2/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/user/anaconda3/envs/motr2/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/user/anaconda3/envs/motr2/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in
main()
File "/home/user/anaconda3/envs/motr2/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/home/user/anaconda3/envs/motr2/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/home/user/anaconda3/envs/motr2/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/user/anaconda3/envs/motr2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/user/anaconda3/envs/motr2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures:
[1]:
time : 2022-12-26_18:00:48
host : user-System-Product-Name
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 118401)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 118401

Root Cause (first observed failure):
[0]:
time : 2022-12-26_18:00:48
host : user-System-Product-Name
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 118400)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 118400

aaghawaheed · 2022-12-27T07:27:13Z

The process stuck at torch.distributed.barrier()

Here is my env information

PyTorch version: 1.13.0
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.10.8 (main, Nov 24 2022, 14:13:03) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.7.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 4090
GPU 1: NVIDIA GeForce RTX 4090
GPU 2: NVIDIA GeForce RTX 4090
GPU 3: NVIDIA GeForce RTX 4090

Nvidia driver version: 525.60.11
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.23.4
[pip3] torch==1.13.0
[pip3] torchaudio==0.13.0
[pip3] torchvision==0.14.0
[conda] blas 1.0 mkl
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py310h7f8727e_0
[conda] mkl_fft 1.3.1 py310hd6ae3a3_0
[conda] mkl_random 1.2.2 py310h00e6091_0
[conda] numpy 1.23.4 py310hd5efca6_0
[conda] numpy-base 1.23.4 py310h8e6c178_0
[conda] pytorch 1.13.0 py3.10_cuda11.7_cudnn8.5.0_0 pytorch
[conda] pytorch-cuda 11.7 h67b0de4_1 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torchaudio 0.13.0 py310_cu117 pytorch
[conda] torchvision 0.14.0 py310_cu117 pytorch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stuck on training #63

Stuck on training #63

aaghawaheed commented Dec 26, 2022 •

edited

Loading

aaghawaheed commented Dec 26, 2022 •

edited

Loading

aaghawaheed commented Dec 27, 2022

Stuck on training #63

Stuck on training #63

Comments

aaghawaheed commented Dec 26, 2022 • edited Loading

aaghawaheed commented Dec 26, 2022 • edited Loading

main.py FAILED

Failures: [1]: time : 2022-12-26_18:00:48 host : user-System-Product-Name rank : 1 (local_rank: 1) exitcode : -6 (pid: 118401) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 118401

aaghawaheed commented Dec 27, 2022

aaghawaheed commented Dec 26, 2022 •

edited

Loading

aaghawaheed commented Dec 26, 2022 •

edited

Loading

Failures:
[1]:
time : 2022-12-26_18:00:48
host : user-System-Product-Name
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 118401)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 118401