When I set the CUDA_VISIBLE_DEVICES environment variable to the UUID of one of my GPUs, torch doesn’t enumerate the device. However, using the device index works fine.
$ nvidia-smi -L
GPU 0: NVIDIA A30 (UUID: GPU-28c5c635-0214-457e-4d24-0282dac9957d)
GPU 1: NVIDIA A30 (UUID: GPU-e031b04a-4f36-83a9-373b-0d6f231a87dc)
$ python
Python 3.8.10 (default, Nov 14 2022, 12:59:47)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.environ['CUDA_VISIBLE_DEVICES'] = 'GPU-28c5c635-0214-457e-4d24-0282dac9957d'
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
0
>>> torch._C._cuda_getDeviceCount()
1
$ python
>>> import os
>>> os.environ['CUDA_VISIBLE_DEVICES'] = '0'
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
1
>>>
Interestingly enough, torch.cuda.device_count() isn’t consistent with torch._C._cuda_getDeviceCount() when using the UUID.
The reason why this is an issue is that I’m inferencing on PyTorch on a MIG-partitioned GPU, and I need to give the script a single MIG slice. As far as I am aware, UUIDs are the only way to do this. When running the model, PyTorch raises RuntimeError: Attempting to deserialize object on CUDA device 0 but torch.cuda.device_count() is 0. Please use torch.load with map_location to map your storages to an existing device.
Is this a PyTorch bug? If so, I’ll open a GitHub issue?
Specifying the UUID sounds like the right approach and was working for me in the past using MIG. However, I usually export the env variable in my terminal so let me double check your code snippet which sets it inside the script.
I’m having the exact same issue (and also on a MIG-partitioned GPU). I didn’t have this issue on a previous version of pytorch though (can’t remember which… just know that it worked in the past)
No, I haven’t fixed the issue. I can update this thread with versions I’m using, though. The problem exists on Python 3.8.10 with torch 1.13.0+cu117 on an Ubuntu 20 machine with NVIDIA driver 520.61.05 and CUDA 11.8. The Docker image that allowed me to use the UUID through the --gpus flag was pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime which has Python 3.10.8 and torch 1.13.0.