Unable to detect CUDA (CUDA unknown error)

The error message is as follows,

>>> torch.cuda.is_available()
/home/yanjie/anaconda3/envs/sarah/lib/python3.7/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at  /opt/conda/conda-bld/pytorch_1603729047590/work/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
False

This error is really weird. I was able to use the CUDA in the morning. But somehow after rebooting, the error showed up. I created a new environment with conda but still cannot solve the problem. Based on the error message, I added export CUDA_VISIBLE_DEVICES=1 in the bashrc file but it didn’t help. I’ve also rebooted for infinity times.

############################################################################

The output of nvidia-smi is
±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.03 Driver Version: 450.119.03 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 207… Off | 00000000:08:00.0 On | N/A |
| 0% 36C P8 26W / 215W | 276MiB / 7959MiB | 1% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 986 G /usr/lib/xorg/Xorg 53MiB |
| 0 N/A N/A 1637 G /usr/lib/xorg/Xorg 80MiB |
| 0 N/A N/A 1772 G /usr/bin/gnome-shell 129MiB |
±----------------------------------------------------------------------------+

############################################################################

The output of nvcc -V is
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:38_PDT_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.TC445_37.28540450_0

############################################################################

For more other information, I run the collect_env.py and the output is as follows,

Collecting environment information…
/home/yanjie/anaconda3/lib/python3.7/site-packages/torch/cuda/init.py:80: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /opt/conda/conda-bld/pytorch_1634272126608/work/c10/cuda/CUDAFunctions.cpp:112.)
return torch._C._cuda_getDeviceCount() > 0
PyTorch version: 1.10.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.11.0-38-generic-x86_64-with-debian-bullseye-sid
Is CUDA available: False
CUDA runtime version: 11.0.194
GPU models and configuration: GPU 0: GeForce RTX 2070 SUPER
Nvidia driver version: 450.119.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.20.3
[pip3] numpydoc==1.1.0
[pip3] torch==1.10.0
[pip3] torchaudio==0.10.0
[pip3] torchvision==0.11.1
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 hfd86e86_1
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.3.0 h06a4308_520
[conda] mkl-service 2.4.0 py37h7f8727e_0
[conda] mkl_fft 1.3.1 py37hd3c417c_0
[conda] mkl_random 1.2.2 py37h51133e4_0
[conda] mypy_extensions 0.4.3 py37_0
[conda] numpy 1.19.1 pypi_0 pypi
[conda] numpy-base 1.20.3 py37h74d4b33_0
[conda] numpydoc 1.1.0 pyhd3eb1b0_1
[conda] pytorch 1.10.0 py3.7_cuda10.2_cudnn7.6.5_0 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torch 1.5.0 pypi_0 pypi
[conda] torchaudio 0.10.0 py37_cu102 pytorch
[conda] torchvision 0.11.1 py37_cu102 pytorch

############################################################################

PS: In the morning, when everything is still workable, the runtime CUDA should be 10.1, the cudatoolkit should be 10.2 and the CUDA version shown in the nvidia-smi should be 11.0. I totally have no idea why a simple reboot can be so disastrous.

Thanks

A simple reboot shouldn’t cause any issues. However, by default your system might try to update all packages in each reboot and I guess that Ubuntu might have tried to update your NVIDIA drivers or CUDA and left them in a broken state.

So should I uninstall the nvidia drivers and reinstall it? Thanks.

Yes, I would reinstall the drivers and disable automatic updates, if not already done.

Really amazing. I uninstall the nvidia driver and reinstall a 470.74 version as recommended. Then, everything is back. Really appreciate your help!