Trying with Stable build of PyTorch with CUDA 11.3 & 11.6
I’m using my university HPC to run my work, it worked fine previously. But this time, PyTorch cannot detect the availability of the GPUs even though nvidia-smi
shows one of the GPUs being idle.
Using nvcc --version
:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
Using nvidia-smi
(it took a longer time to load up the table below):
Sun Sep 25 05:08:11 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A40 On | 00000000:CA:00.0 Off | 0 |
| 0% 30C P8 32W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 ERR! On | 00000000:E3:00.0 Off | ERR! |
|ERR! ERR! ERR! ERR! / ERR! | 0MiB / 46068MiB | ERR! Default |
| | | ERR! |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Both GPUs should be the A40 but one of the cards just shows ERR!
, which is odd as this has never happened before. I tried installing the Stable (1.12.1) of PyTorch with CUDA 11.6, no luck. So I tried to reinstall the version with CUDA 11.3 which previously worked well, with the same result. (I’m not using the vision and audio module so I didn’t revert these two modules.)
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
pip3 install torch --extra-index-url https://download.pytorch.org/whl/cu113
So what happens is, when I run the following code, it returns an empty list with a warning.
import torch
available_gpus = [torch.cuda.device(i) for i in range(torch.cuda.device_count())]
# /scratch/miniconda/conda/envs/jupyterlab/lib/python3.10/site-packages/torch/cuda/__init__.py:83: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
#: [] # empty list
and using this code below, it gives cpu
in return.
device = torch.device("cuda" if (
torch.cuda.is_available() and cuda) else "cpu")
print (device)
# cpu
Running collect_env
, which is the main()
via from torch.utils.collect_env import main
, I get the following output:
Collecting environment information...
PyTorch version: 1.12.1+cu116
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A
OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17
Python version: 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.71.1.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: False
CUDA runtime version: 11.0.221
GPU models and configuration: Could not collect
Nvidia driver version: 515.65.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] efficientnet-pytorch==0.6.3
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.0
[pip3] pytorch-forecasting==0.9.0
[pip3] pytorch-lightning==1.6.5
[pip3] pytorch-tabnet==3.0.0
[pip3] pytorch-tabular==0.7.0
[pip3] segmentation-models-pytorch==0.2.1
[pip3] torch==1.12.1+cu113
[pip3] torchaudio==0.12.1+cu116
[pip3] torchmetrics==0.7.3
[pip3] torchtext==0.13.1
[pip3] torchvision==0.13.1+cu116
[conda] blas 1.0 mkl
[conda] efficientnet-pytorch 0.6.3 pypi_0 pypi
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-fft 1.3.1 pypi_0 pypi
[conda] mkl-random 1.2.2 pypi_0 pypi
[conda] mkl-service 2.4.0 pypi_0 pypi
[conda] mkl_fft 1.3.1 py310h2b4bcf5_1 conda-forge
[conda] mkl_random 1.2.2 py310h00e6091_0
[conda] numpy 1.21.0 pypi_0 pypi
[conda] numpy-base 1.22.3 py310h9585f30_0
[conda] pytorch 1.10.2 cpu_py310h6894f24_0
[conda] pytorch-forecasting 0.9.0 pypi_0 pypi
[conda] pytorch-lightning 1.6.5 pypi_0 pypi
[conda] pytorch-tabnet 3.0.0 pypi_0 pypi
[conda] pytorch-tabular 0.7.0 pypi_0 pypi
[conda] segmentation-models-pytorch 0.2.1 pypi_0 pypi
[conda] torch 1.12.1+cu113 pypi_0 pypi
[conda] torchaudio 0.12.1+cu116 pypi_0 pypi
[conda] torchmetrics 0.7.3 pypi_0 pypi
[conda] torchtext 0.13.1 pypi_0 pypi
[conda] torchvision 0.13.1+cu116 pypi_0 pypi
Trying with Preview (Nightly) build of PyTorch with CUDA 11.7
I have also tried installing the Preview (Nightly) build of PyTorch with CUDA 11.7, but it doesn’t seem to work either.
pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cu117
I terminated the session and queue for a single GPU this time, with nvidia-smi
:
Sun Sep 25 05:46:28 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A40 On | 00000000:CA:00.0 Off | 0 |
| 0% 30C P8 32W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Running collect_env
:
Collecting environment information...
/scratch/miniconda/conda/envs/jupyterlab/lib/python3.10/site-packages/torch/cuda/__init__.py:88: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
PyTorch version: 1.13.0.dev20220924+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A
OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17
Python version: 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.71.1.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: False
CUDA runtime version: 11.0.221
GPU models and configuration: GPU 0: NVIDIA A40
Nvidia driver version: 515.65.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] efficientnet-pytorch==0.6.3
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.0
[pip3] pytorch-forecasting==0.9.0
[pip3] pytorch-lightning==1.6.5
[pip3] pytorch-tabnet==3.0.0
[pip3] pytorch-tabular==0.7.0
[pip3] segmentation-models-pytorch==0.2.1
[pip3] torch==1.13.0.dev20220924+cu117
[pip3] torchaudio==0.12.1+cu116
[pip3] torchmetrics==0.7.3
[pip3] torchtext==0.13.1
[pip3] torchvision==0.13.1+cu116
[conda] blas 1.0 mkl
[conda] efficientnet-pytorch 0.6.3 pypi_0 pypi
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-fft 1.3.1 pypi_0 pypi
[conda] mkl-random 1.2.2 pypi_0 pypi
[conda] mkl-service 2.4.0 pypi_0 pypi
[conda] mkl_fft 1.3.1 py310h2b4bcf5_1 conda-forge
[conda] mkl_random 1.2.2 py310h00e6091_0
[conda] numpy 1.21.0 pypi_0 pypi
[conda] numpy-base 1.22.3 py310h9585f30_0
[conda] pytorch 1.10.2 cpu_py310h6894f24_0
[conda] pytorch-forecasting 0.9.0 pypi_0 pypi
[conda] pytorch-lightning 1.6.5 pypi_0 pypi
[conda] pytorch-tabnet 3.0.0 pypi_0 pypi
[conda] pytorch-tabular 0.7.0 pypi_0 pypi
[conda] segmentation-models-pytorch 0.2.1 pypi_0 pypi
[conda] torch 1.13.0.dev20220924+cu117 pypi_0 pypi
[conda] torchaudio 0.12.1+cu116 pypi_0 pypi
[conda] torchmetrics 0.7.3 pypi_0 pypi
[conda] torchtext 0.13.1 pypi_0 pypi
[conda] torchvision 0.13.1+cu116 pypi_0 pypi
From the output, it seems to be able to recognize the GPU now, and running torch.cuda.device_count()
returns 1, but the warning of /scratch/miniconda/conda/envs/jupyterlab/lib/python3.10/site-packages/torch/cuda/__init__.py:88: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
remains with the following codes also deny the assumption I have…
device = torch.device("cuda" if (torch.cuda.is_available() and cuda) else "cpu")
# device: 'cpu'
torch.cuda.is_available()
# false
torch.cuda.get_device_name(0)
# RuntimeError: CUDA driver initialization failed, you might not have a CUDA GPU.
torch.cuda.current_device()
# RuntimeError: CUDA driver initialization failed, you might not have a CUDA GPU.
Your time and effort are very much appreciated.