Torch.cuda.is_available() returns False, nvidia-smi is working

Same problem here:
I had torch working for Cuda 11.0. Then upgraded to 11.1 (system update) and I am getting the exact error.

“/usr/lib/python3.8/site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at …/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0”

Here is some info which might help you figure out the problem:

[kc@kc-manjaro Projects]$ nvcc --version
nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
[kc@kc-manjaro Projects]$ which nvcc
/opt/cuda/bin/nvcc
[kc@kc-manjaro Projects]$ echo $CUDA_HOME
/opt/cuda
[kc@kc-manjaro Projects]$ echo $LD_LIBRARY_PATH
:/opt/cuda/lib64/:/opt/cuda/lib/:/opt/cuda/extras/CUPTI/lib64

PyTorch version: 1.7.0
Is debug build: Yes
CUDA used to build PyTorch: 11.1
OS: Manjaro Linux
GCC version: (GCC) 10.2.0
CMake version: version 3.18.4
Python version: 3.8
Is CUDA available: No
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce RTX 2070
Nvidia driver version: 450.80.02
cuDNN version: Probably one of the following:
/usr/lib/libcudnn.so.8.0.5
/usr/lib/libcudnn_adv_infer.so.8.0.5
/usr/lib/libcudnn_adv_train.so.8.0.5
/usr/lib/libcudnn_cnn_infer.so.8.0.5
/usr/lib/libcudnn_cnn_train.so.8.0.5
/usr/lib/libcudnn_ops_infer.so.8.0.5
/usr/lib/libcudnn_ops_train.so.8.0.5
/usr/lib/libcudnn_static.a
Versions of relevant libraries:
[pip] Could not collect
[conda] Could not collect

Having the same issue. Any update how to solve it?
I am having CUDA 11.0 and Nvidia driver 450.80.02.
±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:3B:00.0 Off | 0 |
| N/A 22C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
Cuda.available is returning False. Any help would be appreciated!!!

you can see how the cuda version compiles with the nvidia driver version from here.
https://docs.nvidia.com/deploy/cuda-compatibility/index.html

How to know there is two nvidia driver ?

I hape the same problem when my Ubuntu 20.04 hibernates. When it wakes up, torch.cuda.is_available() returns False and I have to reboot the system. There is any other way to avoid rebooting?

I’m also running into the error and can reset the device usually via:

sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm

Alternatively, nvidia-smi --gpu-reset might also work, if the device is not the primary one.

For A100 on ubuntu 20.04, nvidia-fabricmanager needs to be installed after install cuda & driver.

nvidia-smi
# replace the following versions displayed in nvidia-smi
sudo apt install nvidia-fabricmanager-510 libnvidia-nscq-510
sudo systemctl enable nvidia-fabricmanager
sudo systemctl start nvidia-fabricmanager

Yes, if this problem appears after a reboot, it is usually caused by an update that took place since your last reboot, either nvidia driver, or linux kernel.

For me, it was the linux kernel. You can check its date using uname -v. If the date is more recent than your 2nd to last reboot, then it was working on the old kernel, and your last reboot updated it.
The soluton comes from here: What's the process for fixing NVIDIA drivers after kernel updates in Ubuntu 20.04 - #15 by mar2 - Linux - NVIDIA Developer Forums

sudo apt -y install linux-headers-$(uname -r)
If it gives an error and asks you to run apt --fix-broken install , do that instead. Then reboot, and all is well.