PyTorch cannot find GPU, 2021 version

zzzhhh · November 2, 2021, 7:14am

Environment:

Remote Linux with core version 5.8.0. I am not a super user.
Python 3.8.6
CUDA Version: 11.1
GPU is RTX 3090 with driver version 455.23.05
CPU: Intel Core i9-10900K
PyTorch version: 1.8.0+cu111
System imposed RAM quota: 4GB
System imposed number of threads: 512198
System imposed RLIMIT_NPROC value: 300

After I run the following code (immediately after I entered python3 command line, so nothing else run before):

import os
os.environ['OPENBLAS_NUM_THREADS'] = '2'
import torch
torch.cuda.is_available()
torch.cuda.device_count()

torch.cuda.device_count() returns 0 and torch.cuda.is_available() returns False with some additional error messages

/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 2: out of memory (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0

But I can run nvidia-smi and nvcc successfully

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0

which means the GPU hardware and CUDA are installed. Why can PyTorch not see the GPU? Is it possible that PyTorch is divided into GPU version and non-GPU version and the sys admin happened to install a non-GPU version? If so, how can I tell if the PyTorch installed is GPU or non-GPU version? Is there any requirements on CUDA version so that CUDA 11.1 installed does not get along well with PyTorch 1.8.0? If so, what version of PyTorch is CUDA 11.1 happy to work with? Thank you for help.

ptrblck · November 2, 2021, 7:18am

Depending how you’ve installed PyTorch, you can pick between the CPU and CUDA runtimes as described in the install instructions.

Check print(torch.version.cuda) as well as python -m torch.utils.collect_env and make sure a CUDA runtime is found.
If that’s the case, your local setup is most likely broken (e.g. via updating an NVIDIA driver without a restart etc.).

Again, it depends on your setup. If you’ve installed the pip wheels or conda binaries with a CUDA runtime, the local CUDA toolkit won’t be used unless you are building a custom CUDA extension, as the binaries already ship with their own CUDA runtime.

zzzhhh · November 2, 2021, 7:59am

The sys admin installed PyTorch. I don’t know what parameters he used for installation. Does PyTorch keep a copy of command line used for installation?

print(torch.version.cuda) returns

11.1

When I ran torch.utils.collect_env inside python3 command line (to incorporate code that sets OPENBLAS_NUM_THREADS variable), I was told

Traceback (most recent call last):
File “<stdin>”, line 1, in <module>
AttributeError: module ‘torch.utils’ has no attribute ‘collect_env’

But I downloaded the code of collect_env.py here and run it (with a line to set OPENBLAS_NUM_THREADS variable added). The output is as follows:

Collecting environment information…
/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 2: out of memory (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
PyTorch version: 1.8.0+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Pop!_OS 20.10 (x86_64)
GCC version: (Ubuntu 10.2.0-13ubuntu1) 10.2.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.32

Python version: 3.8.6 (default, Jan 27 2021, 15:42:20) [GCC 10.2.0] (64-bit runtime)
Python platform: Linux-5.8.0-7642-generic-x86_64-with-glibc2.32
Is CUDA available: False
CUDA runtime version: 11.1.105
GPU models and configuration: GPU 0: GeForce RTX 3090
Nvidia driver version: 455.23.05
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] torch==1.8.0+cu111
[pip3] torchvision==0.9.0+cu111
[conda] Could not collect

It gives the same error as running torch.cuda.is_available() and torch.cuda.device_count(). Is the CUDA runtime found?

PS, Pop!_OS is a variant of ubuntu. conda is actually installed in the system which I detected using command dpkg -l|grep conda:

ii conda-package-handling 1.7.0-1
amd64 create and extract conda packages of various formats

Please let me know what I can do on my side for further troubleshooting.

ptrblck · November 2, 2021, 8:29am

The binaries are using the CUDA11.1 runtime, so your device should be recognized.
Since you are seeing an error in the __init__ method:

UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 2: out of memory (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)

this points to a setup issue so your sys admin might need to debug the NVIDIA driver and CUDA toolkit installation.

Since it seems you have limited right on this machine, you could check if any CUDA samples can be compiled and executed and might be able to use a docker container as a workaround (however, which could also fail given that the driver might not be installed correctly).

zzzhhh · November 2, 2021, 10:12am

Most of CUDA samples can be compiled and executed, like 0_Simple/vectorAdd, 6_Advanced/eigenvalues and 7_CUDALibraries/simpleCUFFT. Some can be compiled but not executed like 7_CUDALibraries/cuSolverSp_LinearSolver

CUDA error at cuSolverSp_LinearSolver.cpp:278 code=1(CUSPARSE_STATUS_NOT_INITIALIZED) “cusparseCreate(&cusparseHandle)”

, and some can’t be compiled like 7_CUDALibraries/conjugateGradient. Does it mean CUDA is actually set up? Since PyTorch is nothing different from these CUDA samples in terms of calling CUDA interface, if all CUDA APIs PyTorch is calling are those in samples that I can compile and execute, PyTorch should be good to run. But why can’t PyTorch find any GPU? Is it possible I install a PyTorch in my user space (or build up one from source) and this PyTorch version can see the GPU?

ptrblck · November 2, 2021, 11:28pm

I wouldn’t claim that’s the case, as e.g. an NVIDIA driver update without a restart would leave the machine in an “undefined behavior” status to my experience. I.e. some applications might work, other might crash. However, based on the samples it seems that at least the toolkit is partially working.

Yes, you could build PyTorch from source, but I would still claim that your setup might be in a flaky state.

zzzhhh · November 3, 2021, 7:27am

I think the GPU driver and CUDA were pre-installed when the server was purchased. Anyway, I have tried my best to solve the problem in a technical way. Now that this way does not work, I’m forced to solve it in a political way.

Zongze_Yang · December 21, 2021, 6:05am

Did you set the CUDA environment variable CUDA_VISIBLE_DEVICES?
Ref: Programming Guide :: CUDA Toolkit Documentation

my3bikaht · December 21, 2021, 8:08am

CUDA_VISIBLE_DEVICES used to limit visibility

Zongze_Yang · December 21, 2021, 8:43am

I encounter the same “CUDA initialization warning” on a ubuntu server. After I set the variable CUDA_VISIBLE_DEVICES, the warning is gone.

In addition, if I do not set the variable, my CUDA program does not run on GPU as well.

zzzhhh · December 21, 2021, 10:18am

Hi, Zongze, Thank you for the hint. I set this environment variable and tried just now, but the problem persists. I now have a small cluster of powerful workstations under my control and everything is running properly, so I am not concerned with the remote server managed by a terrible sys admin any more. I’m now using TensorFlow2 because it supports distributed training that can make full use of the cluster. I can’t measure the power of RTX 3090 in training and inference, but I think my cluster should beat it. It’s a pity that the two beast-like RTX 3090’s are just kept idle and getting old simply because the sys admin can’t properly install the driver.

Andrea_Rosasco · December 21, 2021, 11:15am

Couldn’t you have installed miniconda in your home directory and then installed pytorch and cudatoolkit inside a conda environment?
You shouldn’t need any root permission to do that.