Torch.cuda.device_count() is 0

When I set the CUDA_VISIBLE_DEVICES environment variable to the UUID of one of my GPUs, torch doesn’t enumerate the device. However, using the device index works fine.

$ nvidia-smi -L
GPU 0: NVIDIA A30 (UUID: GPU-28c5c635-0214-457e-4d24-0282dac9957d)
GPU 1: NVIDIA A30 (UUID: GPU-e031b04a-4f36-83a9-373b-0d6f231a87dc)
$ python
Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.environ['CUDA_VISIBLE_DEVICES'] = 'GPU-28c5c635-0214-457e-4d24-0282dac9957d'
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
0
>>> torch._C._cuda_getDeviceCount()
1
$ python
>>> import os
>>> os.environ['CUDA_VISIBLE_DEVICES'] = '0'
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
1
>>> 

Interestingly enough, torch.cuda.device_count() isn’t consistent with torch._C._cuda_getDeviceCount() when using the UUID.

The reason why this is an issue is that I’m inferencing on PyTorch on a MIG-partitioned GPU, and I need to give the script a single MIG slice. As far as I am aware, UUIDs are the only way to do this. When running the model, PyTorch raises RuntimeError: Attempting to deserialize object on CUDA device 0 but torch.cuda.device_count() is 0. Please use torch.load with map_location to map your storages to an existing device.

Is this a PyTorch bug? If so, I’ll open a GitHub issue?

Interestingly enough, on Docker (using the --gpus device= flag, I can specify the UUID without the above-described issue.

Specifying the UUID sounds like the right approach and was working for me in the past using MIG. However, I usually export the env variable in my terminal so let me double check your code snippet which sets it inside the script.

I’m having the exact same issue (and also on a MIG-partitioned GPU). I didn’t have this issue on a previous version of pytorch though (can’t remember which… just know that it worked in the past)

Do you have any updates or ways to fix it?

No, I haven’t fixed the issue. I can update this thread with versions I’m using, though. The problem exists on Python 3.8.10 with torch 1.13.0+cu117 on an Ubuntu 20 machine with NVIDIA driver 520.61.05 and CUDA 11.8. The Docker image that allowed me to use the UUID through the --gpus flag was pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime which has Python 3.10.8 and torch 1.13.0.

1 Like

After struggling with this for the past few days, the only “solution” I have found is to downgrade to earlier versions of pytorch.

Adding this to my requirements.txt file seems to have fixed the issue.

--extra-index-url https://download.pytorch.org/whl/cu111
torch==1.9.0+cu111
torchvision==0.10.0+cu111

Unfortunately, this is a massive step back from the current versions of the respective programs, but it was the only thing that worked.

If you haven’t already, could you (@coppock) open a GitHub issue? Seems like it is a bug with newer pytorch versions.

Very good. Looks like there’s already an issue open for two weeks. Should’ve checked earlier!

1 Like