No process using GPU, but `CUDA error: all CUDA-capable devices are busy or unavailable`

nkla · November 13, 2020, 4:07am

See latest post for CUDA error: all CUDA-capable devices are busy or unavailable

Resolved issue

I was running Pytorch without issues using GTX 1080 Ti. I recently obtained a RTX3090, and had to make appropriate updates on nvidia drivers for Ampere architecture support. However, I started getting errors when trying to put variables into GPU with .cuda(), and torch.cuda.is_available() returns False. See below.

The same error also occurs in a separate (new) machine with Quadro RTX 5000, leading me to speculate this could be a setup error. However, I do not know the commonalities between the two machines

Machines experiecing the same errors

RTX3090
- Debian Testing
- nvidia-driver: 455.38, from Debian experimental
- nvidia-cuda-toolkit: 11.0.3-2, from Debian testing
Quadro RTX5000
- Debian Testing (VM, vfio passthrough)
- nvidia-driver: 450.80, from Debian testing
- nvidia-cuda-toolkit: 11.0.3-2, from Debian testing

Please let me know if you have any suggestions on troubleshooting this issue.

Thanks

Following results are from the RTX3090 machine

Miniconda env

$ python3 -c 'import torch; print(torch.cuda.is_available())'
/home/user/dev/miniconda3/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at  /opt/conda/conda-bld/pytorch_1603729096996/work/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
False

$ python3 -c 'import torch; torch.rand(3).cuda()'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/user/dev/miniconda3/lib/python3.8/site-packages/torch/cuda/__init__.py", line 172, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.

Miniconda installation

$ conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch

Pip env

$ python3 -c 'import torch; print(torch.cuda.is_available())'
/home/user/.local/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
False

$ python3 -c 'import torch; torch.rand(3).cuda()'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/user/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 172, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.

Pip installation

Attempted to use torch nightly 1.8, with same error

$ python3 -m pip install torch==1.7.0+cu110 torchvision==0.8.1+cu110 torchaudio===0.7.0 -f https://download.pytorch.org/whl/torch_stable.html

System specs

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux bullseye/sid
Release:	testing
Codename:	bullseye

$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38       Driver Version: 455.38       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 3090    On   | 00000000:01:00.0  On |                  N/A |
|  0%   37C    P8    33W / 350W |    282MiB / 24245MiB |     16%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

$ apt list --installed | grep nvidia-*
glx-alternative-nvidia/testing,unstable,now 1.2.0 amd64 [installed,automatic]
libegl-nvidia0/experimental,now 455.38-1 amd64 [installed,automatic]
libegl-nvidia0/experimental,now 455.38-1 i386 [installed,automatic]
libgl1-nvidia-glvnd-glx/experimental,now 455.38-1 amd64 [installed,automatic]
libgl1-nvidia-glvnd-glx/experimental,now 455.38-1 i386 [installed,automatic]
libgles-nvidia1/experimental,now 455.38-1 amd64 [installed,automatic]
libgles-nvidia1/experimental,now 455.38-1 i386 [installed,automatic]
libgles-nvidia2/experimental,now 455.38-1 amd64 [installed,automatic]
libgles-nvidia2/experimental,now 455.38-1 i386 [installed,automatic]
libglx-nvidia0/experimental,now 455.38-1 amd64 [installed,automatic]
libglx-nvidia0/experimental,now 455.38-1 i386 [installed,automatic]
libnvidia-cfg1/experimental,now 455.38-1 amd64 [installed,automatic]
libnvidia-compiler/experimental,now 455.38-1 amd64 [installed,automatic]
libnvidia-eglcore/experimental,now 455.38-1 amd64 [installed,automatic]
libnvidia-eglcore/experimental,now 455.38-1 i386 [installed,automatic]
libnvidia-glcore/experimental,now 455.38-1 amd64 [installed,automatic]
libnvidia-glcore/experimental,now 455.38-1 i386 [installed,automatic]
libnvidia-glvkspirv/experimental,now 455.38-1 amd64 [installed,automatic]
libnvidia-glvkspirv/experimental,now 455.38-1 i386 [installed,automatic]
libnvidia-ml-dev/testing,unstable,now 11.0.3-2 amd64 [installed,automatic]
libnvidia-ml1/experimental,now 455.38-1 amd64 [installed,automatic]
libnvidia-ptxjitcompiler1/experimental,now 455.38-1 amd64 [installed,automatic]
libnvidia-ptxjitcompiler1/experimental,now 455.38-1 i386 [installed,automatic]
nvidia-alternative/experimental,now 455.38-1 amd64 [installed,automatic]
nvidia-cuda-dev/testing,unstable,now 11.0.3-2 amd64 [installed]
nvidia-cuda-gdb/testing,unstable,now 11.0.3-2 amd64 [installed]
nvidia-cuda-toolkit-doc/testing,testing,unstable,unstable,now 11.0.3-2 all [installed,automatic]
nvidia-cuda-toolkit/testing,unstable,now 11.0.3-2 amd64 [installed]
nvidia-driver-bin/experimental,now 455.38-1 amd64 [installed,automatic]
nvidia-driver-libs/experimental,now 455.38-1 amd64 [installed,automatic]
nvidia-driver-libs/experimental,now 455.38-1 i386 [installed,automatic]
nvidia-driver/experimental,now 455.38-1 amd64 [installed]
nvidia-egl-common/now 455.23.04-1 amd64 [installed,local]
nvidia-egl-icd/experimental,now 455.38-1 amd64 [installed,automatic]
nvidia-egl-icd/experimental,now 455.38-1 i386 [installed,automatic]
nvidia-installer-cleanup/testing,unstable,now 20151021+12 amd64 [installed]
nvidia-kernel-common/testing,unstable,now 20151021+12 amd64 [installed]
nvidia-kernel-dkms/experimental,now 455.38-1 amd64 [installed,automatic]
nvidia-kernel-support/experimental,now 455.38-1 amd64 [installed,automatic]
nvidia-legacy-check/now 455.23.04-1 amd64 [installed,local]
nvidia-modprobe/experimental,now 455.23.04-1 amd64 [installed,automatic]
nvidia-opencl-common/now 455.23.04-1 amd64 [installed,local]
nvidia-opencl-dev/testing,unstable,now 11.0.3-2 amd64 [installed]
nvidia-opencl-icd/experimental,now 455.38-1 amd64 [installed,automatic]
nvidia-openjdk-8-jre/testing,unstable,now 9.+8u252-b09-1~deb9u1~11.0.3-2 amd64 [installed,automatic]
nvidia-persistenced/testing,unstable,now 450.57-1 amd64 [installed]
nvidia-profiler/testing,unstable,now 11.0.3-2 amd64 [installed,automatic]
nvidia-settings/testing,unstable,now 450.80.02-1 amd64 [installed]
nvidia-smi/experimental,now 455.38-1 amd64 [installed,automatic]
nvidia-support/testing,unstable,now 20151021+12 amd64 [installed]
nvidia-vdpau-driver/experimental,now 455.38-1 amd64 [installed,automatic]
nvidia-visual-profiler/testing,unstable,now 11.0.3-2 amd64 [installed,automatic]
nvidia-vulkan-common/now 455.23.04-1 amd64 [installed,local]
nvidia-vulkan-icd/experimental,now 455.38-1 amd64 [installed,automatic]
nvidia-vulkan-icd/experimental,now 455.38-1 i386 [installed,automatic]
nvidia-xconfig/testing,unstable,now 450.66-1 amd64 [installed]
xserver-xorg-video-nvidia/experimental,now 455.38-1 amd64 [installed]

Testing nvcc

I’m not an expert in CUDA, but I copied a helloworld code and ran without errors

//hello.cu
// This is the REAL "hello world" for CUDA!
// It takes the string "Hello ", prints it, then passes it to CUDA with an array
// of offsets. Then the offsets are added in parallel to produce the string "World!"
// By Ingemar Ragnemalm 2010
 
#include <stdio.h>
 
const int N = 16; 
const int blocksize = 16; 
 
__global__ 
void hello(char *a, int *b) 
{
	a[threadIdx.x] += b[threadIdx.x];
}
 
int main()
{
	char a[N] = "Hello \0\0\0\0\0\0";
	int b[N] = {15, 10, 6, 0, -11, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
 
	char *ad;
	int *bd;
	const int csize = N*sizeof(char);
	const int isize = N*sizeof(int);
 
	printf("%s", a);
 
	cudaMalloc( (void**)&ad, csize ); 
	cudaMalloc( (void**)&bd, isize ); 
	cudaMemcpy( ad, a, csize, cudaMemcpyHostToDevice ); 
	cudaMemcpy( bd, b, isize, cudaMemcpyHostToDevice ); 
	
	dim3 dimBlock( blocksize, 1 );
	dim3 dimGrid( 1, 1 );
	hello<<<dimGrid, dimBlock>>>(ad, bd);
	cudaMemcpy( a, ad, csize, cudaMemcpyDeviceToHost ); 
	cudaFree( ad );
	cudaFree( bd );
	
	printf("%s\n", a);
	return EXIT_SUCCESS;
}

# nvcc hello.cu -o hello
# ./hello
Hello Hello

Setting envs

Executing the following prior to importing torch do not resolve the errors

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

ptrblck · November 13, 2020, 11:04am

The error

CUDA initialization: CUDA unknown error

is unfortunately not very helpful.
Could you check dmesg for any XID error codes and post them here?
Also, could you check, if docker containers with CUDA11 and PyTorch work fine on your machine?

granth_jain · November 13, 2020, 4:05pm

Hi,

I had an issue on RTX2060 where cuda was not available.

I reinstalled it and worked fine. Mine was dependency issue with tensorflow as tensorflow-gpu runs on cuda11.

Please make sure once that you have installed cuda correctly and also check with tensorflow-gpu if cuda is running fine.
Hope it helps.
Thanks

nkla · November 13, 2020, 8:11pm

Thanks ptrblck, granth_jain.

When I investigate dmesg,

# dmesg|grep "NVRM"
[    9.976755] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  455.38  Thu Oct 22 06:06:59 UTC 2020

I noticed if I run the torch methods which error out, I get the following

[43613.854296] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[43613.854575] nvidia_uvm: Unknown symbol radix_tree_preloads (err -2)

This was caused by Nvidia incompatibility with Kernel 5.9. I downgraded from 5.9 to 5.8, and the errors are resolved.

I applied the fixes to both computers and the errors are resolved.

However, my Quardo RTX 5000 machine is encountering another error, where

$ python3 -c 'import torch; torch.randn(1).to(0)'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

I verified that no process is using the GPU,

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38       Driver Version: 455.38       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 5000     On   | 00000000:04:00.0 Off |                  Off |
| 33%   26C    P8     6W / 230W |      1MiB / 16125MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

and the compute mode is in default, not exclusive,

$ nvidia-smi -a | grep Compute
    Compute Mode                          : Default

This is running inside a kvm hypervisor with vfio passthrough, and I verified that nvidia driver is attached to the GPU

04:00.0 VGA compatible controller: NVIDIA Corporation TU104GL [Quadro RTX 5000] (rev a1)
        Subsystem: Dell TU104GL [Quadro RTX 5000]
        Kernel driver in use: nvidia
        Kernel modules: nvidia
05:00.0 Audio device: NVIDIA Corporation TU104 HD Audio Controller (rev a1)
        Subsystem: Dell TU104 HD Audio Controller
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel
06:00.0 USB controller: NVIDIA Corporation TU104 USB 3.1 Host Controller (rev a1)
        Subsystem: Dell TU104 USB 3.1 Host Controller
        Kernel driver in use: xhci_hcd
        Kernel modules: xhci_pci
07:00.0 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1)
        Subsystem: Dell TU104 USB Type-C UCSI Controller

I attempted to

Remove all nvidia-* packages and reinstall nvidia-driver (tried both 450.80 and 455.38) and nvidia-cuda-toolkit (11.0)
Reinstall Pytorch for Cuda 11.0 using miniconda and pip.

The server is headless, and no desktop environment was installed. Thus, there should be no graphics-based processes using the gpu.

Do you know what is causing this issue? Can this be caused by VFIO, although everything seem to be in order? Thanks!

More tests
I downloaded and compiled the script to test cuda functionality. The output shows error code 201 for cMemGetInfo.

$ ./cuda_check
Found 1 device(s).
Device: 0
  Name: Quadro RTX 5000
  Compute Capability: 7.5
  Multiprocessors: 48
  CUDA Cores: 3072
  Concurrent threads: 49152
  GPU clock: 1815 MHz
  Memory clock: 7001 MHz
  cMemGetInfo failed with error code 201: invalid device context

ptrblck · November 14, 2020, 7:13am

I don’t know, if VFIO could cause this issue. Could you try to run a CUDA sample on this node without VFIO e.g. in a docker container or on the bare metal?

nkla · November 15, 2020, 5:46am

I gave up with VFIO. The original error was me not passing through other 3 components of the GPU (audio, usb, scsi). Now, pytorch works sometimes only if the GPU was originally attached by nouveau, then bind to vfio. Eg, if GPU was originally only used by vfio-pci, pytorch will not work in guest. Instead, python3 binary will be frozen and unkillable, requiring a reset of the guest.

Seems like vfio is not ready for deep learning. I wonder how do colabs run the services? Will be running bare metal now, thanks.

chumingqian · November 16, 2020, 2:51am

Hi,
I have same question with you , my machine is RTX 2060 Notebook series, and i install cuda 11.0.3, cudnn for 11.0;

UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)return torch._C._cuda_getDeviceCount() > 0

Now i am trying to uninstall cuda11 and cudnn;
I was wondering what 's cuda version of your reinstall and before you reintall did you unintall the cuda11 and cudnn both?

nkla · November 16, 2020, 3:52am

Hi, are you running kernel>=5.9? You’ll need to downgrade to <=5.8 since nvidia does not support 5.9 yet.

granth_jain · November 16, 2020, 6:09am

Hi,

I don’t exactly remember my previous cuda version, I installed cuda version 11.

installing tensorflow-gpu then installing pytorch with same cuda version as tensorflow-gpu cuda did the trick for me.

Thanks

chumingqian · November 16, 2020, 8:28am

Hi nkla,
Thanks for your reply, mine kernel is 5.4.0-52-generic;
And i according to userWarning:

this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero

So, i adding the " export CUDA_VISIBLE_DEVICES=0 " via the source gedit ~/.bashrc;
Now , it 's works fine;

import torch
torch.cuda.is_available()
True

Thanks again