UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero

coballe · August 14, 2021, 4:03am

This issue has suddenly arisen whenever I run torch.cuda.is_available.

UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /opt/conda/conda-bld/pytorch_1603729009598/work/c10/cuda/CUDAFunctions.cpp:100.)`

Output of collect_env.py

Collecting environment information…
PyTorch version: 1.7.0
Is debug build: True
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27) [GCC 9.3.0] (64-bit runtime)
Python platform: Linux-5.11.0-25-generic-x86_64-with-glibc2.10
Is CUDA available: False
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: GeForce GTX 1080 Ti
Nvidia driver version: 450.119.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] geotorch==0.2.0
[pip3] numpy==1.19.2
[pip3] numpydoc==1.1.0
[pip3] torch==1.7.0
[pip3] torch-cluster==1.5.8
[pip3] torch-geometric==1.6.3
[pip3] torch-geometric-temporal==0.0.11
[pip3] torch-scatter==2.0.5
[pip3] torch-sparse==0.6.8
[pip3] torch-spline-conv==1.2.0
[pip3] torchaudio==0.7.0a0+ac17b64
[pip3] torchcontrib==0.0.2
[pip3] torchdiffeq==0.2.1
[pip3] torchvision==0.8.1
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.1.243 h6bb024c_0
[conda] geotorch 0.2.0 pypi_0 pypi
[conda] mkl 2020.4 h726a3e6_304 conda-forge
[conda] mkl-service 2.3.0 py38h1e0a361_2 conda-forge
[conda] mkl_fft 1.3.0 py38h5c078b8_1 conda-forge
[conda] mkl_random 1.2.0 py38hc5bc63f_1 conda-forge
[conda] numpy 1.19.2 py38h54aff64_0
[conda] numpy-base 1.19.2 py38hfa32c7d_0
[conda] numpydoc 1.1.0 py_1 conda-forge
[conda] pytorch 1.7.0 py3.8_cuda10.1.243_cudnn7.6.3_0 pytorch
[conda] torch-cluster 1.5.8 pypi_0 pypi
[conda] torch-geometric 1.6.3 pypi_0 pypi
[conda] torch-geometric-temporal 0.0.11 pypi_0 pypi
[conda] torch-scatter 2.0.5 pypi_0 pypi
[conda] torch-sparse 0.6.8 pypi_0 pypi
[conda] torch-spline-conv 1.2.0 pypi_0 pypi
[conda] torchaudio 0.7.0 py38 pytorch
[conda] torchdiffeq 0.2.1 pypi_0 pypi
[conda] torchvision 0.8.1 py38_cu101 pytorch

Output of nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

Finally, output of nvidia-smi

±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.03 Driver Version: 450.119.03 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 108… Off | 00000000:04:00.0 On | N/A |
| 14% 51C P5 12W / 250W | 255MiB / 11177MiB | 1% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 933 G /usr/lib/xorg/Xorg 35MiB |
| 0 N/A N/A 1506 G /usr/lib/xorg/Xorg 78MiB |
| 0 N/A N/A 1632 G /usr/bin/gnome-shell 126MiB |
| 0 N/A N/A 3021 G /usr/lib/firefox/firefox 2MiB |
±----------------------------------------------------------------------------+

Any help would be appreciated.

ptrblck · August 14, 2021, 5:10am

This error is raised e.g. if your system cannot communicate with the GPU, which might be caused e.g. by a driver update without a restart or any other setup issue.
On my personal workstation I see this issue after waking the system from its “suspend” status, as this still does seem to cause such issues (after restarting it, it works again).

coballe · August 14, 2021, 6:07pm

Thanks for the reply. Unfortunately, restarting my machine doesn’t resolve the issue.

Xin.Z · June 2, 2022, 9:59am

Hello, this issue also happened when I wake the Ubuntu 22.04 and run torch.cuda.is_available().
If I reboot it, it will work again. How can I fix it without rebooting the system?
My GPU is RTX3090 with the newest driver 515.43.
Thank you!

ptrblck · June 2, 2022, 5:04pm

You could try to execute:

sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm

which helps on my Ubuntu system after it was suspended.

Xin.Z · June 3, 2022, 2:40am

Thank you for your reply!
I tried the two commands but they did not work.
If I run torch.available, CUDA will report the same problem. Maybe it is a bug about power management of NVIDIA Driver?

ptrblck · June 3, 2022, 3:53am

Yeah, I think it’s a known issue in the interaction of the “Suspend” mode and the driver.
When I have IDEs open, I get sometimes the error: rmmod: ERROR: Module nvidia_uvm is in use and cannot reset the GPU(s). In that case I have to reboot unfortunately, but ~9/10 times these two commands do the job and I can properly use the GPU again.

Oguzhan · December 30, 2022, 5:20pm

I got error after millions of trying these 2 command and still torch.cuda_is_avaliable returns cpu

Lukas_Rois · February 19, 2023, 8:07pm

Anyone still who still has this issue try:

sudo apt-get install nvidia-modprobe

worked for me!
Source: RuntimeError: CUDA unknown error · Issue #49081 · pytorch/pytorch · GitHub

Jason_Li · September 29, 2023, 4:44am

seemed to work for me too!

ThreeStones1029 · November 29, 2023, 2:20pm

Thank you，I ran into this problem when the program was still running, but the system went to sleep and was then interrupted. After I woke up from sleep, torch.cuda.is_available() ran into this issue. After running these two commands, it works.

mtrantalainen · February 23, 2024, 11:29am

I, too, have the problem that the kernel module nvidia_uvm cannot be removed because of error ERROR: Module nvidia_uvm is in use. Do you know if there’s a way to figure out who is using the module? If it’s a process, I could probably kill the offending process.

And the error was triggered by having GPU computation active while putting the system to S3 sleep so this is definitely related to sleep states.

ptrblck · February 23, 2024, 6:40pm

You could try running:

lsmod | grep nvidia
lsof | grep nvidia

which should give you information about the processes.

Dongwei_Zhang · March 2, 2024, 7:46am

I have the same problem in vscode just now, and I just restart the conda environment kernel and it works

csesaswati · April 16, 2024, 5:37pm

I have same issue with the original post. I used this command and restarted the machine but it did not solve the problem.

turbotimon · April 23, 2024, 2:22pm

If I’m not wrong, you can see the processes and their PID by running nvidia-smi?

(Btw, sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm helped me. System restart also. It is clearly related to hibernate in my case on nixOS)

DenisLevchenko · April 30, 2024, 2:01pm

Worked for me, thank you

mtrantalainen · May 1, 2024, 2:48pm

Yeah, I would have assumed so, too, but it later turned out that the culprit was nvtop which was running in a single terminal window. It causes ERROR: Module nvidia_uvm is in use even when nvidia-smi doesn’t show it at all.

turbotimon · September 11, 2024, 9:31am

Note for lightning: This exception even occurs if you want to train explicitly on the CPU with Trainer(accelerator=“cpu”). Because torch/cuda/__init__.py is still loaded during trainer.fit. The solution is:

import os
# Disable GPU visibility. Make sure its BEFORE importing torch (or any other module that uses torch)
os.environ["CUDA_VISIBLE_DEVICES"] = ""
import torch
import lightning
...
trainer=Trainer(accelerator=“cpu”)

Varat7v2 · September 19, 2024, 6:07am

Thanks this worked for me on Ubuntu 24.04.