You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
=> creating model 'mobilenetv2'
Epoch: [1 | 150]
Processing
<Ctrl+C pressed after 10 min of nothing happening:>
^CTraceback (most recent call last):
File "imagenet.py", line 403, in <module>
main()
File "imagenet.py", line 224, in main
train_loss, train_acc = train(train_loader, train_loader_len, model, criterion, optimizer, epoch)
File "imagenet.py", line 271, in train
for i, (input, target) in enumerate(train_loader):
File "/home/michael/mobilenetv2.pytorch/utils/dataloaders.py", line 190, in prefetched_loader
for next_input, next_target in loader:
File "/home/michael/miniconda2/envs/pt/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 804, in __next__
idx, data = self._get_data()
File "/home/michael/miniconda2/envs/pt/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 761, in _get_data
success, data = self._try_get_data()
File "/home/michael/miniconda2/envs/pt/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 724, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/michael/miniconda2/envs/pt/lib/python3.7/queue.py", line 179, in get
self.not_empty.wait(remaining)
File "/home/michael/miniconda2/envs/pt/lib/python3.7/threading.py", line 300, in wait
gotit = waiter.acquire(True, timeout)
KeyboardInterrupt
Nothing is happening at this point. nvidia-smi shows that a single GPU consumes ~500M of memory, and CPU cores are ~60% busy, but it's not clear what are they doing. I waited for 10 minutes before aborting. I also tried it on a single GPU - same issue.
If I switch to --data-backend dali-cpu (using nvidia-dali version 0.16) it fails with the following error:
=> creating model 'mobilenetv2' Traceback (most recent call last): File "imagenet.py", line 403, in <module> main() File "imagenet.py", line 194, in main train_loader, train_loader_len = get_train_loader(args.data, args.batch_size, workers=args.workers, input_size=args.input_size) TypeError: gdtl() got an unexpected keyword argument 'input_size'
I'm using Pytorch 1.3.1 with 4x Titan Xp cards. The only thing I had to change in your code is to replace cuda(async=True) with cuda(non_blocking=True). Changing tonon_blocking=False does not help.
Can you please try cloning your repo to a clean Pytorch 1.3.1 environment and see if you can run it? Any idea what's going on?
The text was updated successfully, but these errors were encountered:
I just cloned your repo and when I'm launching the command:
CUDA_VISIBLE_DEVICES=2,3,4,5 python imagenet.py -a mobilenetv2 -d /path/to/dataset/ImageNet2012/ --epochs 150 --lr-decay cos --lr 0.05 --wd 4e-5 -c checkpoints --width-mult 1 --input-size 224 -j 12
It gets stuck at this point:
Nothing is happening at this point. nvidia-smi shows that a single GPU consumes ~500M of memory, and CPU cores are ~60% busy, but it's not clear what are they doing. I waited for 10 minutes before aborting. I also tried it on a single GPU - same issue.
If I switch to
--data-backend dali-cpu
(using nvidia-dali version 0.16) it fails with the following error:=> creating model 'mobilenetv2' Traceback (most recent call last): File "imagenet.py", line 403, in <module> main() File "imagenet.py", line 194, in main train_loader, train_loader_len = get_train_loader(args.data, args.batch_size, workers=args.workers, input_size=args.input_size) TypeError: gdtl() got an unexpected keyword argument 'input_size'
I'm using Pytorch 1.3.1 with 4x Titan Xp cards. The only thing I had to change in your code is to replace
cuda(async=True)
withcuda(non_blocking=True)
. Changing tonon_blocking=False
does not help.Can you please try cloning your repo to a clean Pytorch 1.3.1 environment and see if you can run it? Any idea what's going on?
The text was updated successfully, but these errors were encountered: