GPUs being detected but "RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found?" HELP!

confusedpotato · January 19, 2024, 4:44pm

Hello,

I am trying to recreate the results of this study: GitHub - insitro/ChannelViT: Channel Vision Transformers: An Image Is Worth C x 16 x 16 Words

However, I keep running into the error

RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found?

The device I am using is from a GPU sever with a Titan RTX and GTX 1080 and according to using nvidia-smi its with CUDA 10.1.

I can’t seem to figure out why the GPUs aren’t being detected. They are being detected when running print(torch.cuda.device_count()) as 2 GPUs and same with running print(os.environ["CUDA_VISIBLE_DEVICES"])

I am using pytorch 2.01 with CUDA 11.7, could this be an issue with the mismatched CUDA version installed on to the server? I have tried running it with using only 1 of the GPUs but same error comes up.

I’ve been told that I am able to use conda to isolate the CUDA version rather than managing it globally. Maybe I’ve done something wrong in the installation, but I can’t seem to figure it out!

Please let me know what I can do!

confusedpotato · January 19, 2024, 4:49pm

I forgot to mention that the pytorch log also comes out as detecting the GPUs

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

ptrblck · January 19, 2024, 4:51pm

Your NVIDIA driver might be too old and you might need to update it in order to run PyTorch with CUDA 11.7.

confusedpotato · January 19, 2024, 4:57pm

Yea, that’s what I assume. Is there a possible different solution to this? This is a GPU server organized by my organization and I am not sure if this is something they can update.

Thanks for the quick respond though!

ptrblck · January 19, 2024, 5:18pm

If you cannot update the NVIDIA driver you would need to use older an older CUDA toolkit and thus also downgrade PyTorch to a version shipping with CUDA 10, which is quite old by now.

confusedpotato · January 30, 2024, 6:01pm

Hi there. Just wondering, if the CUDA driver is 11.0 and driver is 450.119 (I forgot the exact number) would that be able to run pytorch 2.0?

ptrblck · January 30, 2024, 6:21pm

Yes, if you install the PyTorch binaries with CUDA 11.8 as the min. NVIDIA driver would be >=450.80.02 on Linux. To run the PyTorch binaries with CUDA 12.1U1 you would need to install NVIDIA driver >=525.60.13 as seen in the compatibility matrix.

sree_lakshmi_narayan · April 24, 2024, 4:51pm

Hi,

After executing the command to check the availablity of GPU.

python3 -c “import torch; print([(i, torch.cuda.get_device_properties(i)) for i in range(torch.cuda.device_count())])”

output:

[(0, _CudaDeviceProperties(name=‘Tesla V100-SXM2-16GB’, major=7, minor=0, total_memory=16160MB, multi_processor_count=80))]

But still getting error

RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 17848) of binary:

ptrblck · April 24, 2024, 7:51pm

Could you post a minimal and executable code snippet reproducing the issue?

guotong1988 · June 4, 2024, 2:33am

CUDA Version: 11.4 
torch==2.0.0

works

CUDA Version: 11.4 
torch==2.1.0 / torch==2.2.0 / torch==2.3.0

not works