Help why torch.cuda.is_available return True but my GPU didn't work

javierz · June 26, 2023, 9:52am

@ptrblck, can I get help with a similar issue? I have a K80 GPU machine, nvidia driver 470.82.01, cuda version 11.8, pytorch version ‘2.0.1+cu117’. If I run torch.cuda.is_available() it returns True, and torch.cuda.device_count() returns 1.

However, running torch.zeros(1, device="cuda") returns RuntimeError: No CUDA GPUs are available, similar as torch.cuda.get_device_name(); !python -m torch.utils.collect_env also fails with same error.

Pytorch training with GPU works if I install cuda 10.2 on the machine. However, the K80 should be compatible with compute capability 3.7, the installed driver and cuda 11.8). Running torch.cuda.get_arch_list() responds ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86'], supporting 3.7, correct?

Why wouldn’t it work? Any way to enable a more updated driver? Thank you!

!nvidia-smi shows

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000003:00:00.0 Off |                    0 |
| N/A   32C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

ptrblck · June 26, 2023, 4:05pm

This shouldn’t change anything as your locally installed CUDA toolkit won’t be used if you execute the PyTorch binaries, since they ship with their own CUDA dependencies. I thus guess you have a driver issues and the side effect of installing the full CUDA 10.2 toolkit also reinstalls the driver.

javierz · June 26, 2023, 4:24pm

This is a VM so I do not install the drivers, but there might be an issue as you say. Thanks.

sa2706 · January 1, 2025, 10:47am

@javierz were you able to solve the issue? I am facing same issue with Azure VM, cuda is avaliable but I get the same error when I try to use it.

javierz · January 1, 2025, 7:46pm

I did not solve the problem. We changed gpus and no longer use the k80s. No problem on the newer machines.