Torch.cuda.device_count() is 0

coppock · December 17, 2022, 5:35pm

When I set the CUDA_VISIBLE_DEVICES environment variable to the UUID of one of my GPUs, torch doesn’t enumerate the device. However, using the device index works fine.

$ nvidia-smi -L
GPU 0: NVIDIA A30 (UUID: GPU-28c5c635-0214-457e-4d24-0282dac9957d)
GPU 1: NVIDIA A30 (UUID: GPU-e031b04a-4f36-83a9-373b-0d6f231a87dc)
$ python
Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.environ['CUDA_VISIBLE_DEVICES'] = 'GPU-28c5c635-0214-457e-4d24-0282dac9957d'
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
0
>>> torch._C._cuda_getDeviceCount()
1
$ python
>>> import os
>>> os.environ['CUDA_VISIBLE_DEVICES'] = '0'
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
1
>>>

Interestingly enough, torch.cuda.device_count() isn’t consistent with torch._C._cuda_getDeviceCount() when using the UUID.

The reason why this is an issue is that I’m inferencing on PyTorch on a MIG-partitioned GPU, and I need to give the script a single MIG slice. As far as I am aware, UUIDs are the only way to do this. When running the model, PyTorch raises RuntimeError: Attempting to deserialize object on CUDA device 0 but torch.cuda.device_count() is 0. Please use torch.load with map_location to map your storages to an existing device.

Is this a PyTorch bug? If so, I’ll open a GitHub issue?

coppock · December 17, 2022, 7:49pm

Interestingly enough, on Docker (using the --gpus device= flag, I can specify the UUID without the above-described issue.

ptrblck · December 17, 2022, 9:52pm

Specifying the UUID sounds like the right approach and was working for me in the past using MIG. However, I usually export the env variable in my terminal so let me double check your code snippet which sets it inside the script.

xenova · December 19, 2022, 7:09pm

I’m having the exact same issue (and also on a MIG-partitioned GPU). I didn’t have this issue on a previous version of pytorch though (can’t remember which… just know that it worked in the past)

Do you have any updates or ways to fix it?

coppock · December 19, 2022, 7:30pm

No, I haven’t fixed the issue. I can update this thread with versions I’m using, though. The problem exists on Python 3.8.10 with torch 1.13.0+cu117 on an Ubuntu 20 machine with NVIDIA driver 520.61.05 and CUDA 11.8. The Docker image that allowed me to use the UUID through the --gpus flag was pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime which has Python 3.10.8 and torch 1.13.0.

xenova · December 21, 2022, 8:18pm

After struggling with this for the past few days, the only “solution” I have found is to downgrade to earlier versions of pytorch.

Adding this to my requirements.txt file seems to have fixed the issue.

--extra-index-url https://download.pytorch.org/whl/cu111
torch==1.9.0+cu111
torchvision==0.10.0+cu111

Unfortunately, this is a massive step back from the current versions of the respective programs, but it was the only thing that worked.

If you haven’t already, could you (@coppock) open a GitHub issue? Seems like it is a bug with newer pytorch versions.

coppock · December 23, 2022, 2:42pm

Very good. Looks like there’s already an issue open for two weeks. Should’ve checked earlier!