See latest post for CUDA error: all CUDA-capable devices are busy or unavailable
Resolved issue
I was running Pytorch without issues using GTX 1080 Ti. I recently obtained a RTX3090, and had to make appropriate updates on nvidia drivers for Ampere architecture support. However, I started getting errors when trying to put variables into GPU with .cuda(), and torch.cuda.is_available()
returns False
. See below.
The same error also occurs in a separate (new) machine with Quadro RTX 5000, leading me to speculate this could be a setup error. However, I do not know the commonalities between the two machines
Machines experiecing the same errors
- RTX3090
- Debian Testing
- nvidia-driver: 455.38, from Debian experimental
- nvidia-cuda-toolkit: 11.0.3-2, from Debian testing
- Quadro RTX5000
- Debian Testing (VM, vfio passthrough)
- nvidia-driver: 450.80, from Debian testing
- nvidia-cuda-toolkit: 11.0.3-2, from Debian testing
Please let me know if you have any suggestions on troubleshooting this issue.
Thanks
Following results are from the RTX3090 machine
Miniconda env
$ python3 -c 'import torch; print(torch.cuda.is_available())'
/home/user/dev/miniconda3/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /opt/conda/conda-bld/pytorch_1603729096996/work/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
False
$ python3 -c 'import torch; torch.rand(3).cuda()'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/user/dev/miniconda3/lib/python3.8/site-packages/torch/cuda/__init__.py", line 172, in _lazy_init
torch._C._cuda_init()
RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.
Miniconda installation
$ conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch
Pip env
$ python3 -c 'import torch; print(torch.cuda.is_available())'
/home/user/.local/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
False
$ python3 -c 'import torch; torch.rand(3).cuda()'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/user/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 172, in _lazy_init
torch._C._cuda_init()
RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.
Pip installation
- Attempted to use torch nightly 1.8, with same error
$ python3 -m pip install torch==1.7.0+cu110 torchvision==0.8.1+cu110 torchaudio===0.7.0 -f https://download.pytorch.org/whl/torch_stable.html
System specs
$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux bullseye/sid
Release: testing
Codename: bullseye
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38 Driver Version: 455.38 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 On | 00000000:01:00.0 On | N/A |
| 0% 37C P8 33W / 350W | 282MiB / 24245MiB | 16% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
$ apt list --installed | grep nvidia-*
glx-alternative-nvidia/testing,unstable,now 1.2.0 amd64 [installed,automatic]
libegl-nvidia0/experimental,now 455.38-1 amd64 [installed,automatic]
libegl-nvidia0/experimental,now 455.38-1 i386 [installed,automatic]
libgl1-nvidia-glvnd-glx/experimental,now 455.38-1 amd64 [installed,automatic]
libgl1-nvidia-glvnd-glx/experimental,now 455.38-1 i386 [installed,automatic]
libgles-nvidia1/experimental,now 455.38-1 amd64 [installed,automatic]
libgles-nvidia1/experimental,now 455.38-1 i386 [installed,automatic]
libgles-nvidia2/experimental,now 455.38-1 amd64 [installed,automatic]
libgles-nvidia2/experimental,now 455.38-1 i386 [installed,automatic]
libglx-nvidia0/experimental,now 455.38-1 amd64 [installed,automatic]
libglx-nvidia0/experimental,now 455.38-1 i386 [installed,automatic]
libnvidia-cfg1/experimental,now 455.38-1 amd64 [installed,automatic]
libnvidia-compiler/experimental,now 455.38-1 amd64 [installed,automatic]
libnvidia-eglcore/experimental,now 455.38-1 amd64 [installed,automatic]
libnvidia-eglcore/experimental,now 455.38-1 i386 [installed,automatic]
libnvidia-glcore/experimental,now 455.38-1 amd64 [installed,automatic]
libnvidia-glcore/experimental,now 455.38-1 i386 [installed,automatic]
libnvidia-glvkspirv/experimental,now 455.38-1 amd64 [installed,automatic]
libnvidia-glvkspirv/experimental,now 455.38-1 i386 [installed,automatic]
libnvidia-ml-dev/testing,unstable,now 11.0.3-2 amd64 [installed,automatic]
libnvidia-ml1/experimental,now 455.38-1 amd64 [installed,automatic]
libnvidia-ptxjitcompiler1/experimental,now 455.38-1 amd64 [installed,automatic]
libnvidia-ptxjitcompiler1/experimental,now 455.38-1 i386 [installed,automatic]
nvidia-alternative/experimental,now 455.38-1 amd64 [installed,automatic]
nvidia-cuda-dev/testing,unstable,now 11.0.3-2 amd64 [installed]
nvidia-cuda-gdb/testing,unstable,now 11.0.3-2 amd64 [installed]
nvidia-cuda-toolkit-doc/testing,testing,unstable,unstable,now 11.0.3-2 all [installed,automatic]
nvidia-cuda-toolkit/testing,unstable,now 11.0.3-2 amd64 [installed]
nvidia-driver-bin/experimental,now 455.38-1 amd64 [installed,automatic]
nvidia-driver-libs/experimental,now 455.38-1 amd64 [installed,automatic]
nvidia-driver-libs/experimental,now 455.38-1 i386 [installed,automatic]
nvidia-driver/experimental,now 455.38-1 amd64 [installed]
nvidia-egl-common/now 455.23.04-1 amd64 [installed,local]
nvidia-egl-icd/experimental,now 455.38-1 amd64 [installed,automatic]
nvidia-egl-icd/experimental,now 455.38-1 i386 [installed,automatic]
nvidia-installer-cleanup/testing,unstable,now 20151021+12 amd64 [installed]
nvidia-kernel-common/testing,unstable,now 20151021+12 amd64 [installed]
nvidia-kernel-dkms/experimental,now 455.38-1 amd64 [installed,automatic]
nvidia-kernel-support/experimental,now 455.38-1 amd64 [installed,automatic]
nvidia-legacy-check/now 455.23.04-1 amd64 [installed,local]
nvidia-modprobe/experimental,now 455.23.04-1 amd64 [installed,automatic]
nvidia-opencl-common/now 455.23.04-1 amd64 [installed,local]
nvidia-opencl-dev/testing,unstable,now 11.0.3-2 amd64 [installed]
nvidia-opencl-icd/experimental,now 455.38-1 amd64 [installed,automatic]
nvidia-openjdk-8-jre/testing,unstable,now 9.+8u252-b09-1~deb9u1~11.0.3-2 amd64 [installed,automatic]
nvidia-persistenced/testing,unstable,now 450.57-1 amd64 [installed]
nvidia-profiler/testing,unstable,now 11.0.3-2 amd64 [installed,automatic]
nvidia-settings/testing,unstable,now 450.80.02-1 amd64 [installed]
nvidia-smi/experimental,now 455.38-1 amd64 [installed,automatic]
nvidia-support/testing,unstable,now 20151021+12 amd64 [installed]
nvidia-vdpau-driver/experimental,now 455.38-1 amd64 [installed,automatic]
nvidia-visual-profiler/testing,unstable,now 11.0.3-2 amd64 [installed,automatic]
nvidia-vulkan-common/now 455.23.04-1 amd64 [installed,local]
nvidia-vulkan-icd/experimental,now 455.38-1 amd64 [installed,automatic]
nvidia-vulkan-icd/experimental,now 455.38-1 i386 [installed,automatic]
nvidia-xconfig/testing,unstable,now 450.66-1 amd64 [installed]
xserver-xorg-video-nvidia/experimental,now 455.38-1 amd64 [installed]
Testing nvcc
I’m not an expert in CUDA, but I copied a helloworld code and ran without errors
//hello.cu
// This is the REAL "hello world" for CUDA!
// It takes the string "Hello ", prints it, then passes it to CUDA with an array
// of offsets. Then the offsets are added in parallel to produce the string "World!"
// By Ingemar Ragnemalm 2010
#include <stdio.h>
const int N = 16;
const int blocksize = 16;
__global__
void hello(char *a, int *b)
{
a[threadIdx.x] += b[threadIdx.x];
}
int main()
{
char a[N] = "Hello \0\0\0\0\0\0";
int b[N] = {15, 10, 6, 0, -11, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
char *ad;
int *bd;
const int csize = N*sizeof(char);
const int isize = N*sizeof(int);
printf("%s", a);
cudaMalloc( (void**)&ad, csize );
cudaMalloc( (void**)&bd, isize );
cudaMemcpy( ad, a, csize, cudaMemcpyHostToDevice );
cudaMemcpy( bd, b, isize, cudaMemcpyHostToDevice );
dim3 dimBlock( blocksize, 1 );
dim3 dimGrid( 1, 1 );
hello<<<dimGrid, dimBlock>>>(ad, bd);
cudaMemcpy( a, ad, csize, cudaMemcpyDeviceToHost );
cudaFree( ad );
cudaFree( bd );
printf("%s\n", a);
return EXIT_SUCCESS;
}
# nvcc hello.cu -o hello
# ./hello
Hello Hello
Setting envs
Executing the following prior to importing torch do not resolve the errors
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"