Unable to detect CUDA (CUDA unknown error)

erwawa · October 26, 2021, 4:01am

The error message is as follows,

>>> torch.cuda.is_available()
/home/yanjie/anaconda3/envs/sarah/lib/python3.7/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at  /opt/conda/conda-bld/pytorch_1603729047590/work/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
False

This error is really weird. I was able to use the CUDA in the morning. But somehow after rebooting, the error showed up. I created a new environment with conda but still cannot solve the problem. Based on the error message, I added export CUDA_VISIBLE_DEVICES=1 in the bashrc file but it didn’t help. I’ve also rebooted for infinity times.

############################################################################

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 986 G /usr/lib/xorg/Xorg 53MiB |
| 0 N/A N/A 1637 G /usr/lib/xorg/Xorg 80MiB |
| 0 N/A N/A 1772 G /usr/bin/gnome-shell 129MiB |
±----------------------------------------------------------------------------+

############################################################################

The output of nvcc -V is
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:38_PDT_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.TC445_37.28540450_0

############################################################################

For more other information, I run the collect_env.py and the output is as follows,

Collecting environment information…
/home/yanjie/anaconda3/lib/python3.7/site-packages/torch/cuda/init.py:80: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /opt/conda/conda-bld/pytorch_1634272126608/work/c10/cuda/CUDAFunctions.cpp:112.)
return torch._C._cuda_getDeviceCount() > 0
PyTorch version: 1.10.0
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.11.0-38-generic-x86_64-with-debian-bullseye-sid
Is CUDA available: False
CUDA runtime version: 11.0.194
GPU models and configuration: GPU 0: GeForce RTX 2070 SUPER
Nvidia driver version: 450.119.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.20.3
[pip3] numpydoc==1.1.0
[pip3] torch==1.10.0
[pip3] torchaudio==0.10.0
[pip3] torchvision==0.11.1
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 hfd86e86_1
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.3.0 h06a4308_520
[conda] mkl-service 2.4.0 py37h7f8727e_0
[conda] mkl_fft 1.3.1 py37hd3c417c_0
[conda] mkl_random 1.2.2 py37h51133e4_0
[conda] mypy_extensions 0.4.3 py37_0
[conda] numpy 1.19.1 pypi_0 pypi
[conda] numpy-base 1.20.3 py37h74d4b33_0
[conda] numpydoc 1.1.0 pyhd3eb1b0_1
[conda] pytorch 1.10.0 py3.7_cuda10.2_cudnn7.6.5_0 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torch 1.5.0 pypi_0 pypi
[conda] torchaudio 0.10.0 py37_cu102 pytorch
[conda] torchvision 0.11.1 py37_cu102 pytorch

############################################################################

PS: In the morning, when everything is still workable, the runtime CUDA should be 10.1, the cudatoolkit should be 10.2 and the CUDA version shown in the nvidia-smi should be 11.0. I totally have no idea why a simple reboot can be so disastrous.

Thanks

ptrblck · October 26, 2021, 4:28am

A simple reboot shouldn’t cause any issues. However, by default your system might try to update all packages in each reboot and I guess that Ubuntu might have tried to update your NVIDIA drivers or CUDA and left them in a broken state.

erwawa · October 26, 2021, 4:31am

So should I uninstall the nvidia drivers and reinstall it? Thanks.

ptrblck · October 26, 2021, 4:35am

Yes, I would reinstall the drivers and disable automatic updates, if not already done.

erwawa · October 26, 2021, 5:42am

Really amazing. I uninstall the nvidia driver and reinstall a 470.74 version as recommended. Then, everything is back. Really appreciate your help!

Atharva_Kshirsagar · June 27, 2023, 6:30am

I am getting something similar, except nvcc is not working for me

mabl3 · June 24, 2025, 4:32pm

I just want to leave this here because I had the same issue but with a different setup, and kept getting this thread as a search result for my problem. Maybe someone else finds this helpful!

So I got the same error as OP, but while trying to run a program using Pytorch in a Docker container that’s based on an NGC pytorch:<xx.yy>-py3 image. (It was version 22.12, but 25.05 also failed so I don’t think it has much to do with the versions. I also tried different GPU driver versions. Host OS is Ubuntu 24.04, GPU is Quadro RTX 6000.)

Inside the container, nvidia-smi showed the host driver and cuda version as expected, but in python, torch.cuda.is_available() was false and trying to run the application resulted in the “CUDA unknown error”.

The solution was to add the --privileged flag to the docker run command, i.e.: docker run --privileged --rm --gpus all --it .... Without that flag, some mmap operation fails, idk. It leads to Pytorch not being able to use the GPU, anyway.

Btw I finally figured that out when I tried to run only the base NGC pytorch container and got way more helpful error messages that way,