CUDA initialization: CUDA unknown error

nicolo-lucchesi · August 12, 2021, 8:56am

I’m currently unable to utilize my GPU as I get this error any time I try to use CUDA functions:
UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)

Funny thing is that I don’t recall having upgraded my drivers or changing anything related to pytorch in the last few months, it used to be working fine.

I’ve tried re-installing pytroch multiple times (through conda), testing out different cuda toolkit versions but it doesn’t seem to help. I’ve also tried re-booting.
Here’s the collect_env output:

Collecting environment information...
/home/nick/miniconda3/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
PyTorch version: 1.8.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.8.10 | packaged by conda-forge | (default, May 11 2021, 07:01:05)  [GCC 9.3.0] (64-bit runtime)
Python platform: Linux-5.11.0-25-generic-x86_64-with-glibc2.10
Is CUDA available: False
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce GTX 1070 Ti
Nvidia driver version: 450.119.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.0
[pip3] numpy-quaternion==2021.3.17.16.51.43
[pip3] pytorch-lightning==1.3.1
[pip3] pytorch-lightning-bolts==0.3.2
[pip3] torch==1.8.0
[pip3] torchmetrics==0.3.2
[pip3] torchvision==0.9.0
[conda] cudatoolkit               11.0.221             h6bb024c_0    nvidia
[conda] numpy                     1.20.1                   pypi_0    pypi
[conda] numpy-quaternion          2021.3.17.16.51.43          pypi_0    pypi
[conda] pytorch-lightning         1.3.1                    pypi_0    pypi
[conda] pytorch-lightning-bolts   0.3.2                    pypi_0    pypi
[conda] torch                     1.8.0                    pypi_0    pypi
[conda] torchmetrics              0.3.2                    pypi_0    pypi
[conda] torchvision               0.9.0                    pypi_0    pypi

Also, nvidia-smi and nvtop seem to be working fine, here’s the output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.03   Driver Version: 450.119.03   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 107...  Off  | 00000000:08:00.0  On |                  N/A |
|  5%   49C    P0    38W / 180W |    524MiB /  8113MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1135      G   /usr/lib/xorg/Xorg                 83MiB |
|    0   N/A  N/A      1832      G   /usr/lib/xorg/Xorg                212MiB |
|    0   N/A  N/A      1970      G   /usr/bin/gnome-shell              170MiB |
|    0   N/A  N/A      2156      G   ...wnloads/Telegram/Telegram        5MiB |
|    0   N/A  N/A      2458      G   .../debug.log --shared-files       32MiB |
+-----------------------------------------------------------------------------+

Thanks in advance for any help.

ptrblck · August 12, 2021, 9:03am

Could Ubuntu have automatically tried to update some drivers or did you explicitly disable this option (in the past I ran into similar issues and had to reinstall the driver).
In case you have a local CUDA toolkit installed, try to compile and run some CUDA samples.
If not, you could use a docker container and execute some smoke tests there using the CUDA samples.

nicolo-lucchesi · August 17, 2021, 2:23pm

Thanks for the tip, I solved the problem re-installing nvidia drivers, I suspect some half initialized update left the system in a sort of tangled state…?

For the record, I did

$ sudo apt update
$ sudo apt upgrade
$ sudo apt install nvidia-driver-450