GPUs being detected but "RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found?" HELP!

Hello,

I am trying to recreate the results of this study: GitHub - insitro/ChannelViT: Channel Vision Transformers: An Image Is Worth C x 16 x 16 Words

However, I keep running into the error

RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found?

The device I am using is from a GPU sever with a Titan RTX and GTX 1080 and according to using nvidia-smi its with CUDA 10.1.

I can’t seem to figure out why the GPUs aren’t being detected. They are being detected when running print(torch.cuda.device_count()) as 2 GPUs and same with running print(os.environ["CUDA_VISIBLE_DEVICES"])

I am using pytorch 2.01 with CUDA 11.7, could this be an issue with the mismatched CUDA version installed on to the server? I have tried running it with using only 1 of the GPUs but same error comes up.

I’ve been told that I am able to use conda to isolate the CUDA version rather than managing it globally. Maybe I’ve done something wrong in the installation, but I can’t seem to figure it out!

Please let me know what I can do!

I forgot to mention that the pytorch log also comes out as detecting the GPUs

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

Your NVIDIA driver might be too old and you might need to update it in order to run PyTorch with CUDA 11.7.

Yea, that’s what I assume. Is there a possible different solution to this? This is a GPU server organized by my organization and I am not sure if this is something they can update.

Thanks for the quick respond though!

If you cannot update the NVIDIA driver you would need to use older an older CUDA toolkit and thus also downgrade PyTorch to a version shipping with CUDA 10, which is quite old by now.

Hi there. Just wondering, if the CUDA driver is 11.0 and driver is 450.119 (I forgot the exact number) would that be able to run pytorch 2.0?

Yes, if you install the PyTorch binaries with CUDA 11.8 as the min. NVIDIA driver would be >=450.80.02 on Linux. To run the PyTorch binaries with CUDA 12.1U1 you would need to install NVIDIA driver >=525.60.13 as seen in the compatibility matrix.

Hi,

After executing the command to check the availablity of GPU.

python3 -c “import torch; print([(i, torch.cuda.get_device_properties(i)) for i in range(torch.cuda.device_count())])”

output:

[(0, _CudaDeviceProperties(name=‘Tesla V100-SXM2-16GB’, major=7, minor=0, total_memory=16160MB, multi_processor_count=80))]

But still getting error

RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 17848) of binary:

Could you post a minimal and executable code snippet reproducing the issue?

CUDA Version: 11.4 
torch==2.0.0

works

CUDA Version: 11.4 
torch==2.1.0 / torch==2.2.0 / torch==2.3.0

not works