-
Notifications
You must be signed in to change notification settings - Fork 207
Open
Labels
Description
Describe the bug
Both CUDA_VISIBLE_DEVICES
and ROCR_VISIBLE_DEVICES
affect GPU pinning on HIP systems.
Setting both can result in some applications (e.g., PyTorch) to fail to find a GPU.
> CUDA_VISIBLE_DEVICES=1 python -c "import torch; print(torch.cuda.device_count()); torch.zeros(4).to('cuda')"
1
> ROCR_VISIBLE_DEVICES=1 CUDA_VISIBLE_DEVICES=1 python -c "import torch; print(torch.cuda.device_count()); torch.zeros(4).to('cuda')"
0
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/autofs/nccs-svm1_proj/mat291/lward/mof-generation-at-scale/env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 372, in _lazy_init
torch._C._cuda_init()
RuntimeError: No HIP GPUs are available
> ROCR_VISIBLE_DEVICES=1 CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.device_count()); torch.zeros(4).to('cuda')"
1
> ROCR_VISIBLE_DEVICES=1 python -c "import torch; print(torch.cuda.device_count()); torch.zeros(4).to('cuda')"
1
Parsl sets both, which makes the GPU pinning fail on HIP systems (e.g., Frontier)
To Reproduce
TBD. I don't have a minimal example
Expected behavior
Each worker to find exactly 1 GPU available
Environment
Frontier, Parsl as of mid-April 2025
Distributed Environment
Frontier