UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero

This issue has suddenly arisen whenever I run torch.cuda.is_available.

UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /opt/conda/conda-bld/pytorch_1603729009598/work/c10/cuda/CUDAFunctions.cpp:100.)`

Output of collect_env.py

Collecting environment information…
PyTorch version: 1.7.0
Is debug build: True
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27) [GCC 9.3.0] (64-bit runtime)
Python platform: Linux-5.11.0-25-generic-x86_64-with-glibc2.10
Is CUDA available: False
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: GeForce GTX 1080 Ti
Nvidia driver version: 450.119.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] geotorch==0.2.0
[pip3] numpy==1.19.2
[pip3] numpydoc==1.1.0
[pip3] torch==1.7.0
[pip3] torch-cluster==1.5.8
[pip3] torch-geometric==1.6.3
[pip3] torch-geometric-temporal==0.0.11
[pip3] torch-scatter==2.0.5
[pip3] torch-sparse==0.6.8
[pip3] torch-spline-conv==1.2.0
[pip3] torchaudio==0.7.0a0+ac17b64
[pip3] torchcontrib==0.0.2
[pip3] torchdiffeq==0.2.1
[pip3] torchvision==0.8.1
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.1.243 h6bb024c_0
[conda] geotorch 0.2.0 pypi_0 pypi
[conda] mkl 2020.4 h726a3e6_304 conda-forge
[conda] mkl-service 2.3.0 py38h1e0a361_2 conda-forge
[conda] mkl_fft 1.3.0 py38h5c078b8_1 conda-forge
[conda] mkl_random 1.2.0 py38hc5bc63f_1 conda-forge
[conda] numpy 1.19.2 py38h54aff64_0
[conda] numpy-base 1.19.2 py38hfa32c7d_0
[conda] numpydoc 1.1.0 py_1 conda-forge
[conda] pytorch 1.7.0 py3.8_cuda10.1.243_cudnn7.6.3_0 pytorch
[conda] torch-cluster 1.5.8 pypi_0 pypi
[conda] torch-geometric 1.6.3 pypi_0 pypi
[conda] torch-geometric-temporal 0.0.11 pypi_0 pypi
[conda] torch-scatter 2.0.5 pypi_0 pypi
[conda] torch-sparse 0.6.8 pypi_0 pypi
[conda] torch-spline-conv 1.2.0 pypi_0 pypi
[conda] torchaudio 0.7.0 py38 pytorch
[conda] torchdiffeq 0.2.1 pypi_0 pypi
[conda] torchvision 0.8.1 py38_cu101 pytorch

Output of nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

Finally, output of nvidia-smi

±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.03 Driver Version: 450.119.03 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 108… Off | 00000000:04:00.0 On | N/A |
| 14% 51C P5 12W / 250W | 255MiB / 11177MiB | 1% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 933 G /usr/lib/xorg/Xorg 35MiB |
| 0 N/A N/A 1506 G /usr/lib/xorg/Xorg 78MiB |
| 0 N/A N/A 1632 G /usr/bin/gnome-shell 126MiB |
| 0 N/A N/A 3021 G /usr/lib/firefox/firefox 2MiB |
±----------------------------------------------------------------------------+

Any help would be appreciated.

1 Like

This error is raised e.g. if your system cannot communicate with the GPU, which might be caused e.g. by a driver update without a restart or any other setup issue.
On my personal workstation I see this issue after waking the system from its “suspend” status, as this still does seem to cause such issues (after restarting it, it works again).

8 Likes

Thanks for the reply. Unfortunately, restarting my machine doesn’t resolve the issue.

1 Like

Hello, this issue also happened when I wake the Ubuntu 22.04 and run torch.cuda.is_available().
If I reboot it, it will work again. How can I fix it without rebooting the system?
My GPU is RTX3090 with the newest driver 515.43.
Thank you!

1 Like

You could try to execute:

sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm

which helps on my Ubuntu system after it was suspended.

17 Likes

Thank you for your reply!
I tried the two commands but they did not work.
If I run torch.available, CUDA will report the same problem. Maybe it is a bug about power management of NVIDIA Driver?

Yeah, I think it’s a known issue in the interaction of the “Suspend” mode and the driver.
When I have IDEs open, I get sometimes the error: rmmod: ERROR: Module nvidia_uvm is in use and cannot reset the GPU(s). In that case I have to reboot unfortunately, but ~9/10 times these two commands do the job and I can properly use the GPU again.

2 Likes

I got error after millions of trying these 2 command and still torch.cuda_is_avaliable returns cpu :confused:

Anyone still who still has this issue try:

sudo apt-get install nvidia-modprobe

worked for me!
Source: RuntimeError: CUDA unknown error · Issue #49081 · pytorch/pytorch · GitHub

5 Likes

seemed to work for me too!

Thank you,I ran into this problem when the program was still running, but the system went to sleep and was then interrupted. After I woke up from sleep, torch.cuda.is_available() ran into this issue. After running these two commands, it works.

I, too, have the problem that the kernel module nvidia_uvm cannot be removed because of error ERROR: Module nvidia_uvm is in use. Do you know if there’s a way to figure out who is using the module? If it’s a process, I could probably kill the offending process.

And the error was triggered by having GPU computation active while putting the system to S3 sleep so this is definitely related to sleep states.

You could try running:

lsmod | grep nvidia
lsof | grep nvidia

which should give you information about the processes.

I have the same problem in vscode just now, and I just restart the conda environment kernel and it works

I have same issue with the original post. I used this command and restarted the machine but it did not solve the problem.

If I’m not wrong, you can see the processes and their PID by running nvidia-smi?

(Btw, sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm helped me. System restart also. It is clearly related to hibernate in my case on nixOS)

Worked for me, thank you

Yeah, I would have assumed so, too, but it later turned out that the culprit was nvtop which was running in a single terminal window. It causes ERROR: Module nvidia_uvm is in use even when nvidia-smi doesn’t show it at all.

1 Like

Note for lightning: This exception even occurs if you want to train explicitly on the CPU with Trainer(accelerator=“cpu”). Because torch/cuda/__init__.py is still loaded during trainer.fit. The solution is:

import os
# Disable GPU visibility. Make sure its BEFORE importing torch (or any other module that uses torch)
os.environ["CUDA_VISIBLE_DEVICES"] = ""
import torch
import lightning
...
trainer=Trainer(accelerator=“cpu”)

Thanks this worked for me on Ubuntu 24.04.