Skip to content

Setting both ROCR_VISIBLE_DEVICES and CUDA_VISIBLE_DEVICES breaks pinning on HIP devices #3858

@WardLT

Description

@WardLT

Describe the bug

Both CUDA_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES affect GPU pinning on HIP systems.
Setting both can result in some applications (e.g., PyTorch) to fail to find a GPU.

> CUDA_VISIBLE_DEVICES=1 python -c "import torch; print(torch.cuda.device_count()); torch.zeros(4).to('cuda')"
1
> ROCR_VISIBLE_DEVICES=1 CUDA_VISIBLE_DEVICES=1 python -c "import torch; print(torch.cuda.device_count()); torch.zeros(4).to('cuda')"
0
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/autofs/nccs-svm1_proj/mat291/lward/mof-generation-at-scale/env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 372, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No HIP GPUs are available
> ROCR_VISIBLE_DEVICES=1 CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.device_count()); torch.zeros(4).to('cuda')"
1
> ROCR_VISIBLE_DEVICES=1 python -c "import torch; print(torch.cuda.device_count()); torch.zeros(4).to('cuda')"
1

Parsl sets both, which makes the GPU pinning fail on HIP systems (e.g., Frontier)

To Reproduce

TBD. I don't have a minimal example

Expected behavior

Each worker to find exactly 1 GPU available

Environment
Frontier, Parsl as of mid-April 2025

Distributed Environment
Frontier

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions