PyTorch unable to identify the GPU and CUDA

Trying with Stable build of PyTorch with CUDA 11.3 & 11.6

I’m using my university HPC to run my work, it worked fine previously. But this time, PyTorch cannot detect the availability of the GPUs even though nvidia-smi shows one of the GPUs being idle.

Using nvcc --version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

Using nvidia-smi (it took a longer time to load up the table below):

Sun Sep 25 05:08:11 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          On   | 00000000:CA:00.0 Off |                    0 |
|  0%   30C    P8    32W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  ERR!                On   | 00000000:E3:00.0 Off |                 ERR! |
|ERR!  ERR! ERR!    ERR! / ERR! |      0MiB / 46068MiB |    ERR!      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Both GPUs should be the A40 but one of the cards just shows ERR!, which is odd as this has never happened before. I tried installing the Stable (1.12.1) of PyTorch with CUDA 11.6, no luck. So I tried to reinstall the version with CUDA 11.3 which previously worked well, with the same result. (I’m not using the vision and audio module so I didn’t revert these two modules.)

pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116

pip3 install torch --extra-index-url https://download.pytorch.org/whl/cu113

So what happens is, when I run the following code, it returns an empty list with a warning.

import torch
available_gpus = [torch.cuda.device(i) for i in range(torch.cuda.device_count())]

# /scratch/miniconda/conda/envs/jupyterlab/lib/python3.10/site-packages/torch/cuda/__init__.py:83: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:109.)
    return torch._C._cuda_getDeviceCount() > 0

#: [] # empty list

and using this code below, it gives cpu in return.

device = torch.device("cuda" if (
torch.cuda.is_available() and cuda) else "cpu")
    
print (device)
# cpu

Running collect_env, which is the main() via from torch.utils.collect_env import main, I get the following output:

Collecting environment information...
PyTorch version: 1.12.1+cu116
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.71.1.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: False
CUDA runtime version: 11.0.221
GPU models and configuration: Could not collect
Nvidia driver version: 515.65.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] efficientnet-pytorch==0.6.3
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.0
[pip3] pytorch-forecasting==0.9.0
[pip3] pytorch-lightning==1.6.5
[pip3] pytorch-tabnet==3.0.0
[pip3] pytorch-tabular==0.7.0
[pip3] segmentation-models-pytorch==0.2.1
[pip3] torch==1.12.1+cu113
[pip3] torchaudio==0.12.1+cu116
[pip3] torchmetrics==0.7.3
[pip3] torchtext==0.13.1
[pip3] torchvision==0.13.1+cu116
[conda] blas                      1.0                         mkl  
[conda] efficientnet-pytorch      0.6.3                    pypi_0    pypi
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] mkl-fft                   1.3.1                    pypi_0    pypi
[conda] mkl-random                1.2.2                    pypi_0    pypi
[conda] mkl-service               2.4.0                    pypi_0    pypi
[conda] mkl_fft                   1.3.1           py310h2b4bcf5_1    conda-forge
[conda] mkl_random                1.2.2           py310h00e6091_0  
[conda] numpy                     1.21.0                   pypi_0    pypi
[conda] numpy-base                1.22.3          py310h9585f30_0  
[conda] pytorch                   1.10.2          cpu_py310h6894f24_0  
[conda] pytorch-forecasting       0.9.0                    pypi_0    pypi
[conda] pytorch-lightning         1.6.5                    pypi_0    pypi
[conda] pytorch-tabnet            3.0.0                    pypi_0    pypi
[conda] pytorch-tabular           0.7.0                    pypi_0    pypi
[conda] segmentation-models-pytorch 0.2.1                    pypi_0    pypi
[conda] torch                     1.12.1+cu113             pypi_0    pypi
[conda] torchaudio                0.12.1+cu116             pypi_0    pypi
[conda] torchmetrics              0.7.3                    pypi_0    pypi
[conda] torchtext                 0.13.1                   pypi_0    pypi
[conda] torchvision               0.13.1+cu116             pypi_0    pypi

Trying with Preview (Nightly) build of PyTorch with CUDA 11.7

I have also tried installing the Preview (Nightly) build of PyTorch with CUDA 11.7, but it doesn’t seem to work either.

pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cu117

I terminated the session and queue for a single GPU this time, with nvidia-smi:

Sun Sep 25 05:46:28 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          On   | 00000000:CA:00.0 Off |                    0 |
|  0%   30C    P8    32W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Running collect_env:

Collecting environment information...
/scratch/miniconda/conda/envs/jupyterlab/lib/python3.10/site-packages/torch/cuda/__init__.py:88: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
PyTorch version: 1.13.0.dev20220924+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.71.1.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: False
CUDA runtime version: 11.0.221
GPU models and configuration: GPU 0: NVIDIA A40
Nvidia driver version: 515.65.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] efficientnet-pytorch==0.6.3
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.0
[pip3] pytorch-forecasting==0.9.0
[pip3] pytorch-lightning==1.6.5
[pip3] pytorch-tabnet==3.0.0
[pip3] pytorch-tabular==0.7.0
[pip3] segmentation-models-pytorch==0.2.1
[pip3] torch==1.13.0.dev20220924+cu117
[pip3] torchaudio==0.12.1+cu116
[pip3] torchmetrics==0.7.3
[pip3] torchtext==0.13.1
[pip3] torchvision==0.13.1+cu116
[conda] blas                      1.0                         mkl  
[conda] efficientnet-pytorch      0.6.3                    pypi_0    pypi
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] mkl-fft                   1.3.1                    pypi_0    pypi
[conda] mkl-random                1.2.2                    pypi_0    pypi
[conda] mkl-service               2.4.0                    pypi_0    pypi
[conda] mkl_fft                   1.3.1           py310h2b4bcf5_1    conda-forge
[conda] mkl_random                1.2.2           py310h00e6091_0  
[conda] numpy                     1.21.0                   pypi_0    pypi
[conda] numpy-base                1.22.3          py310h9585f30_0  
[conda] pytorch                   1.10.2          cpu_py310h6894f24_0  
[conda] pytorch-forecasting       0.9.0                    pypi_0    pypi
[conda] pytorch-lightning         1.6.5                    pypi_0    pypi
[conda] pytorch-tabnet            3.0.0                    pypi_0    pypi
[conda] pytorch-tabular           0.7.0                    pypi_0    pypi
[conda] segmentation-models-pytorch 0.2.1                    pypi_0    pypi
[conda] torch                     1.13.0.dev20220924+cu117          pypi_0    pypi
[conda] torchaudio                0.12.1+cu116             pypi_0    pypi
[conda] torchmetrics              0.7.3                    pypi_0    pypi
[conda] torchtext                 0.13.1                   pypi_0    pypi
[conda] torchvision               0.13.1+cu116             pypi_0    pypi

From the output, it seems to be able to recognize the GPU now, and running torch.cuda.device_count() returns 1, but the warning of /scratch/miniconda/conda/envs/jupyterlab/lib/python3.10/site-packages/torch/cuda/__init__.py:88: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.) remains with the following codes also deny the assumption I have…

device = torch.device("cuda" if (torch.cuda.is_available() and cuda) else "cpu")
# device: 'cpu'

torch.cuda.is_available()
# false

torch.cuda.get_device_name(0)
# RuntimeError: CUDA driver initialization failed, you might not have a CUDA GPU.

torch.cuda.current_device()
# RuntimeError: CUDA driver initialization failed, you might not have a CUDA GPU.

Your time and effort are very much appreciated.

Try to fix this issue first e.g. by reinstalling the drivers before running any PyTorch workloads, as it seems your setup is broken.

@ptrblck thanks for the input, I willreflect that to the team.

1 Like

Hi Team, I have similar problem. I am using 2xA100 GPUs and both the gpu health is good (attaching nvidia-smi and nvcc --version results).

But when I check the available CUDA device through pytorch, it is not getting detected.

You could see the following error message from the screenshot.

CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /opt/conda/conda-bld/pytorch_1711403380481/work/c10/cuda/CUDAFunctions.cpp:108.)

Your support and guidance is much appreciated.

I don’t know how you’ve installed the drivers, but the nvcc --version output shows CUDA 10.1, which is incompatible with your Ampere GPUs. If you’ve installed CUDA 10.1 including the drivers for some reason, uninstall them and reinstall either a current driver only or the latest CUDA toolkit including the driver.

Thank you @ptrblck Appreciate the quick response. It worked after updating CUDA Toolkit.

However, CUDA Compiler Version is still 10.1.

Good to hear it’s working! Based on the nvcc output I would guess you’ve installed multiple CUDA toolkits and the right driver is now used.