This issue has suddenly arisen whenever I run torch.cuda.is_available.
UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /opt/conda/conda-bld/pytorch_1603729009598/work/c10/cuda/CUDAFunctions.cpp:100.)`
Output of collect_env.py
Collecting environment information…
PyTorch version: 1.7.0
Is debug build: True
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31
Python version: 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27) [GCC 9.3.0] (64-bit runtime)
Python platform: Linux-5.11.0-25-generic-x86_64-with-glibc2.10
Is CUDA available: False
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: GeForce GTX 1080 Ti
Nvidia driver version: 450.119.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
This error is raised e.g. if your system cannot communicate with the GPU, which might be caused e.g. by a driver update without a restart or any other setup issue.
On my personal workstation I see this issue after waking the system from its “suspend” status, as this still does seem to cause such issues (after restarting it, it works again).
Hello, this issue also happened when I wake the Ubuntu 22.04 and run torch.cuda.is_available().
If I reboot it, it will work again. How can I fix it without rebooting the system?
My GPU is RTX3090 with the newest driver 515.43.
Thank you!
Thank you for your reply!
I tried the two commands but they did not work.
If I run torch.available, CUDA will report the same problem. Maybe it is a bug about power management of NVIDIA Driver?
Yeah, I think it’s a known issue in the interaction of the “Suspend” mode and the driver.
When I have IDEs open, I get sometimes the error: rmmod: ERROR: Module nvidia_uvm is in use and cannot reset the GPU(s). In that case I have to reboot unfortunately, but ~9/10 times these two commands do the job and I can properly use the GPU again.
Thank you,I ran into this problem when the program was still running, but the system went to sleep and was then interrupted. After I woke up from sleep, torch.cuda.is_available() ran into this issue. After running these two commands, it works.
I, too, have the problem that the kernel module nvidia_uvm cannot be removed because of error ERROR: Module nvidia_uvm is in use. Do you know if there’s a way to figure out who is using the module? If it’s a process, I could probably kill the offending process.
And the error was triggered by having GPU computation active while putting the system to S3 sleep so this is definitely related to sleep states.
Yeah, I would have assumed so, too, but it later turned out that the culprit was nvtop which was running in a single terminal window. It causes ERROR: Module nvidia_uvm is in use even when nvidia-smi doesn’t show it at all.
Note for lightning: This exception even occurs if you want to train explicitly on the CPU with Trainer(accelerator=“cpu”). Because torch/cuda/__init__.py is still loaded during trainer.fit. The solution is:
import os
# Disable GPU visibility. Make sure its BEFORE importing torch (or any other module that uses torch)
os.environ["CUDA_VISIBLE_DEVICES"] = ""
import torch
import lightning
...
trainer=Trainer(accelerator=“cpu”)