PyTorch unable to identify the GPU and CUDA

Angus_Tay · September 24, 2022, 8:06pm

Trying with Stable build of PyTorch with CUDA 11.3 & 11.6

I’m using my university HPC to run my work, it worked fine previously. But this time, PyTorch cannot detect the availability of the GPUs even though nvidia-smi shows one of the GPUs being idle.

Using nvcc --version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

Using nvidia-smi (it took a longer time to load up the table below):

Sun Sep 25 05:08:11 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          On   | 00000000:CA:00.0 Off |                    0 |
|  0%   30C    P8    32W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  ERR!                On   | 00000000:E3:00.0 Off |                 ERR! |
|ERR!  ERR! ERR!    ERR! / ERR! |      0MiB / 46068MiB |    ERR!      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Both GPUs should be the A40 but one of the cards just shows ERR!, which is odd as this has never happened before. I tried installing the Stable (1.12.1) of PyTorch with CUDA 11.6, no luck. So I tried to reinstall the version with CUDA 11.3 which previously worked well, with the same result. (I’m not using the vision and audio module so I didn’t revert these two modules.)

pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116

pip3 install torch --extra-index-url https://download.pytorch.org/whl/cu113

So what happens is, when I run the following code, it returns an empty list with a warning.

import torch
available_gpus = [torch.cuda.device(i) for i in range(torch.cuda.device_count())]

# /scratch/miniconda/conda/envs/jupyterlab/lib/python3.10/site-packages/torch/cuda/__init__.py:83: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:109.)
    return torch._C._cuda_getDeviceCount() > 0

#: [] # empty list

and using this code below, it gives cpu in return.

device = torch.device("cuda" if (
torch.cuda.is_available() and cuda) else "cpu")
    
print (device)
# cpu

Running collect_env, which is the main() via from torch.utils.collect_env import main, I get the following output:

Collecting environment information...
PyTorch version: 1.12.1+cu116
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.71.1.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: False
CUDA runtime version: 11.0.221
GPU models and configuration: Could not collect
Nvidia driver version: 515.65.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] efficientnet-pytorch==0.6.3
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.0
[pip3] pytorch-forecasting==0.9.0
[pip3] pytorch-lightning==1.6.5
[pip3] pytorch-tabnet==3.0.0
[pip3] pytorch-tabular==0.7.0
[pip3] segmentation-models-pytorch==0.2.1
[pip3] torch==1.12.1+cu113
[pip3] torchaudio==0.12.1+cu116
[pip3] torchmetrics==0.7.3
[pip3] torchtext==0.13.1
[pip3] torchvision==0.13.1+cu116
[conda] blas                      1.0                         mkl  
[conda] efficientnet-pytorch      0.6.3                    pypi_0    pypi
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] mkl-fft                   1.3.1                    pypi_0    pypi
[conda] mkl-random                1.2.2                    pypi_0    pypi
[conda] mkl-service               2.4.0                    pypi_0    pypi
[conda] mkl_fft                   1.3.1           py310h2b4bcf5_1    conda-forge
[conda] mkl_random                1.2.2           py310h00e6091_0  
[conda] numpy                     1.21.0                   pypi_0    pypi
[conda] numpy-base                1.22.3          py310h9585f30_0  
[conda] pytorch                   1.10.2          cpu_py310h6894f24_0  
[conda] pytorch-forecasting       0.9.0                    pypi_0    pypi
[conda] pytorch-lightning         1.6.5                    pypi_0    pypi
[conda] pytorch-tabnet            3.0.0                    pypi_0    pypi
[conda] pytorch-tabular           0.7.0                    pypi_0    pypi
[conda] segmentation-models-pytorch 0.2.1                    pypi_0    pypi
[conda] torch                     1.12.1+cu113             pypi_0    pypi
[conda] torchaudio                0.12.1+cu116             pypi_0    pypi
[conda] torchmetrics              0.7.3                    pypi_0    pypi
[conda] torchtext                 0.13.1                   pypi_0    pypi
[conda] torchvision               0.13.1+cu116             pypi_0    pypi

Trying with Preview (Nightly) build of PyTorch with CUDA 11.7

I have also tried installing the Preview (Nightly) build of PyTorch with CUDA 11.7, but it doesn’t seem to work either.

pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cu117

I terminated the session and queue for a single GPU this time, with nvidia-smi:

Sun Sep 25 05:46:28 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          On   | 00000000:CA:00.0 Off |                    0 |
|  0%   30C    P8    32W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Running collect_env:

Collecting environment information...
/scratch/miniconda/conda/envs/jupyterlab/lib/python3.10/site-packages/torch/cuda/__init__.py:88: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
PyTorch version: 1.13.0.dev20220924+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.71.1.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: False
CUDA runtime version: 11.0.221
GPU models and configuration: GPU 0: NVIDIA A40
Nvidia driver version: 515.65.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] efficientnet-pytorch==0.6.3
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.0
[pip3] pytorch-forecasting==0.9.0
[pip3] pytorch-lightning==1.6.5
[pip3] pytorch-tabnet==3.0.0
[pip3] pytorch-tabular==0.7.0
[pip3] segmentation-models-pytorch==0.2.1
[pip3] torch==1.13.0.dev20220924+cu117
[pip3] torchaudio==0.12.1+cu116
[pip3] torchmetrics==0.7.3
[pip3] torchtext==0.13.1
[pip3] torchvision==0.13.1+cu116
[conda] blas                      1.0                         mkl  
[conda] efficientnet-pytorch      0.6.3                    pypi_0    pypi
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] mkl-fft                   1.3.1                    pypi_0    pypi
[conda] mkl-random                1.2.2                    pypi_0    pypi
[conda] mkl-service               2.4.0                    pypi_0    pypi
[conda] mkl_fft                   1.3.1           py310h2b4bcf5_1    conda-forge
[conda] mkl_random                1.2.2           py310h00e6091_0  
[conda] numpy                     1.21.0                   pypi_0    pypi
[conda] numpy-base                1.22.3          py310h9585f30_0  
[conda] pytorch                   1.10.2          cpu_py310h6894f24_0  
[conda] pytorch-forecasting       0.9.0                    pypi_0    pypi
[conda] pytorch-lightning         1.6.5                    pypi_0    pypi
[conda] pytorch-tabnet            3.0.0                    pypi_0    pypi
[conda] pytorch-tabular           0.7.0                    pypi_0    pypi
[conda] segmentation-models-pytorch 0.2.1                    pypi_0    pypi
[conda] torch                     1.13.0.dev20220924+cu117          pypi_0    pypi
[conda] torchaudio                0.12.1+cu116             pypi_0    pypi
[conda] torchmetrics              0.7.3                    pypi_0    pypi
[conda] torchtext                 0.13.1                   pypi_0    pypi
[conda] torchvision               0.13.1+cu116             pypi_0    pypi

From the output, it seems to be able to recognize the GPU now, and running torch.cuda.device_count() returns 1, but the warning of /scratch/miniconda/conda/envs/jupyterlab/lib/python3.10/site-packages/torch/cuda/__init__.py:88: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.) remains with the following codes also deny the assumption I have…

device = torch.device("cuda" if (torch.cuda.is_available() and cuda) else "cpu")
# device: 'cpu'

torch.cuda.is_available()
# false

torch.cuda.get_device_name(0)
# RuntimeError: CUDA driver initialization failed, you might not have a CUDA GPU.

torch.cuda.current_device()
# RuntimeError: CUDA driver initialization failed, you might not have a CUDA GPU.

Your time and effort are very much appreciated.

ptrblck · September 24, 2022, 8:11pm

Try to fix this issue first e.g. by reinstalling the drivers before running any PyTorch workloads, as it seems your setup is broken.

Angus_Tay · September 25, 2022, 4:51am

@ptrblck thanks for the input, I willreflect that to the team.

Denesh_Kumar_Mani · April 10, 2024, 2:59pm

Hi Team, I have similar problem. I am using 2xA100 GPUs and both the gpu health is good (attaching nvidia-smi and nvcc --version results).

But when I check the available CUDA device through pytorch, it is not getting detected.

You could see the following error message from the screenshot.

CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /opt/conda/conda-bld/pytorch_1711403380481/work/c10/cuda/CUDAFunctions.cpp:108.)

Your support and guidance is much appreciated.

ptrblck · April 10, 2024, 3:30pm

I don’t know how you’ve installed the drivers, but the nvcc --version output shows CUDA 10.1, which is incompatible with your Ampere GPUs. If you’ve installed CUDA 10.1 including the drivers for some reason, uninstall them and reinstall either a current driver only or the latest CUDA toolkit including the driver.

Denesh_Kumar_Mani · April 11, 2024, 5:49pm

Thank you @ptrblck Appreciate the quick response. It worked after updating CUDA Toolkit.

However, CUDA Compiler Version is still 10.1.

ptrblck · April 11, 2024, 7:09pm

Good to hear it’s working! Based on the nvcc output I would guess you’ve installed multiple CUDA toolkits and the right driver is now used.