Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pin memory issue during training. #98

Open
ajeetkverma opened this issue Jul 14, 2024 · 0 comments
Open

Pin memory issue during training. #98

ajeetkverma opened this issue Jul 14, 2024 · 0 comments

Comments

@ajeetkverma
Copy link

ajeetkverma commented Jul 14, 2024

while training on single gpu, getting pin memory thread exited. it is happening after on round of train/val. I have made it to validate on 10 iters, so that this issue can be captured within time. please suggest the possible reasons for it. error snap is given below.....

2024-07-14 13:56:04,122 - mmedit - INFO - Checkpoints will be saved to /home/ubuntu/Desktop/RealBasicVSR/experiments/realbasicvsr_c64b20_1x30x8_lr5e-5_150k_reds by HardDiskBackend.
2024-07-14 13:56:04.586294: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-07-14 13:56:04.636049: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-14 13:56:05.362593: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/home/ubuntu/anaconda3/envs/rbvsr/lib/python3.8/site-packages/torch/nn/functional.py:3103: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details.
warnings.warn("The default behavior for interpolate/upsample with float scale_factor changed "
2024-07-14 13:56:32,710 - mmedit - INFO - Saving checkpoint at 10 iterations
1/1, 0.0 task/s, elapsed: 28s, ETA: 0s2024-07-14 13:57:02,335 - mmedit - INFO - Iter(val) [10] PSNR: 27.7156
2024-07-14 13:57:02,337 - mmedit - INFO - Iter(val) [10]
2024-07-14 13:57:23,808 - mmedit - INFO - Saving checkpoint at 20 iterations
[ ] 0/1, elapsed: 0s, ETA:Exception in thread Thread-3:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/rbvsr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/home/ubuntu/anaconda3/envs/rbvsr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/anaconda3/envs/rbvsr/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 28, in _pin_memory_loop
idx, data = r
ValueError: not enough values to unpack (expected 2, got 0)
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/rbvsr/lib/python3.8/site-packages/mmedit/.mim/tools/train.py", line 171, in
main()
File "/home/ubuntu/anaconda3/envs/rbvsr/lib/python3.8/site-packages/mmedit/.mim/tools/train.py", line 160, in main
train_model(
File "/home/ubuntu/anaconda3/envs/rbvsr/lib/python3.8/site-packages/mmedit/apis/train.py", line 107, in train_model
_non_dist_train(
File "/home/ubuntu/anaconda3/envs/rbvsr/lib/python3.8/site-packages/mmedit/apis/train.py", line 363, in _non_dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_iters)
File "/home/ubuntu/anaconda3/envs/rbvsr/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 144, in run
iter_runner(iter_loaders[i], **kwargs)
File "/home/ubuntu/anaconda3/envs/rbvsr/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 70, in train
self.call_hook('after_train_iter')
File "/home/ubuntu/anaconda3/envs/rbvsr/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 317, in call_hook
getattr(hook, fn_name)(self)
File "/home/ubuntu/anaconda3/envs/rbvsr/lib/python3.8/site-packages/mmedit/core/evaluation/eval_hooks.py", line 42, in after_train_iter
results = single_gpu_test(
File "/home/ubuntu/anaconda3/envs/rbvsr/lib/python3.8/site-packages/mmedit/apis/test.py", line 41, in single_gpu_test
for data in data_loader:
File "/home/ubuntu/anaconda3/envs/rbvsr/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 349, in iter
self._iterator._reset(self)
File "/home/ubuntu/anaconda3/envs/rbvsr/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 852, in _reset
data = self._get_data()
File "/home/ubuntu/anaconda3/envs/rbvsr/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1029, in _get_data
raise RuntimeError('Pin memory thread exited unexpectedly')
RuntimeError: Pin memory thread exited unexpectedly
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/rbvsr/bin/mim", line 8, in
sys.exit(cli())
File "/home/ubuntu/anaconda3/envs/rbvsr/lib/python3.8/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/rbvsr/lib/python3.8/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/ubuntu/anaconda3/envs/rbvsr/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ubuntu/anaconda3/envs/rbvsr/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ubuntu/anaconda3/envs/rbvsr/lib/python3.8/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/rbvsr/lib/python3.8/site-packages/mim/commands/train.py", line 100, in cli
is_success, msg = train(
File "/home/ubuntu/anaconda3/envs/rbvsr/lib/python3.8/site-packages/mim/commands/train.py", line 261, in train
ret = subprocess.check_call(
File "/home/ubuntu/anaconda3/envs/rbvsr/lib/python3.8/subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/home/ubuntu/anaconda3/envs/rbvsr/bin/python', '/home/ubuntu/anaconda3/envs/rbvsr/lib/python3.8/site-packages/mmedit/.mim/tools/train.py', 'configs/realbasicvsr_c64b20_1x30x8_lr5e-5_150k_reds.py', '--launcher', 'none', '--gpus', '1']' returned non-zero exit status 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant