-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stuck on training #63
Comments
sh configs/r50_motr_train.sh warnings.warn( Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. | distributed init (rank 0): env:// [E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809347 milliseconds before timing out. main.py FAILEDFailures:
|
The process stuck at torch.distributed.barrier() Here is my env information PyTorch version: 1.13.0 OS: Ubuntu 20.04.5 LTS (x86_64) Python version: 3.10.8 (main, Nov 24 2022, 14:13:03) [GCC 11.2.0] (64-bit runtime) Nvidia driver version: 525.60.11 Versions of relevant libraries: |
When I start training on multiple gpus it stuck, you can check in screenshot
The text was updated successfully, but these errors were encountered: