Torch::cuda::is_available() returns false

Deepali · March 19, 2020, 11:20am

I have built PyTorch v1.4.0 locally and I am trying the C++ frontend.

I am seeing that the CPP API - torch::cuda::is_available() returns false.
With the following snippet:

#include <torch/torch.h>

int main() 
{
    torch::Tensor a = torch::ones({ 2, 2 }, torch::requires_grad());
    torch::Tensor b = torch::randn({ 2, 2 });
    a.to(torch::kCUDA);    
    b.to(torch::kCUDA);   
    auto c = a + b;
}

I saw the complaint:
what(): CUDA error: CUDA driver version is insufficient for CUDA runtime version

However in python I don’t see any such issue:

>>> torch.cuda.is_available()
True
>>> a = torch.tensor([1., 2.], device="cuda")
>>> a
tensor([1., 2.], device='cuda:0')

What could be wrong here?

21fa417e3fb06e56040c · March 19, 2020, 1:06pm

When you try torch::cuda::is_available() return false?
right?

Try modifying it like the code below.

    torch::Tensor a = torch::ones({ 2, 2 }, torch::requires_grad());
    torch::Tensor b = torch::randn({ 2, 2 });
    a = a.to(torch::kCUDA);    
    b = b.to(torch::kCUDA);

21fa417e3fb06e56040c · March 19, 2020, 1:09pm

Also check your system.
You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

Deepali · March 19, 2020, 1:32pm

Cuda information collected is:

Is CUDA available: Yes
CUDA runtime version: 10.2.89
Nvidia driver version: 440.33.01

hartbx · March 19, 2020, 5:19pm

I work with @Deepali and have been looking at this as well. I think I see what’s happening now…

The libcuda.so that’s being loaded at run time is the stub version that comes with CUDA Toolkit. That’s suitable for dynamic linking at build time, but isn’t actually functional.

$ LD_DEBUG=all ./libtest 2>&1 | grep 'object=.*libcuda[.]so'
     95995:     object=/opt/anaconda3/envs/test_env/lib64/stubs/libcuda.so.1 [0]
     95995:     object=/opt/anaconda3/envs/test_env/lib64/stubs/libcuda.so.1 [0]

And that library is being found because of the RPATH in the built binary:

$ objdump -p libtest | grep RPATH
  RPATH                [...]:/opt/anaconda3/envs/test_env/lib64/stubs:[...]

Unfortunately RPATH takes precedence over LD_LIBRARY_PATH :

$ LD_LIBARY_PATH=/usr/lib64 ./libtest
device_count: 0
Is available? 0

But LD_PRELOAD would get around the problem:

$ LD_PRELOAD=/usr/lib64/libcuda.so.440.33.01 ./libtest
device_count: 4
Is available? 1

So I think the trick will be to find a way to prevent the CUDA Toolkit stubs/ library directory from being added to RPATH.

I’m not sure whether it’s something specific to our environment (we’re building in a conda setup) that causes that, or if it would be something in the general libtorch stuff.

I tried just not installing the CUDA Toolkit development stuff (leaving only the runtime installed), but then the Torch cmake find_package stuff fails:

CUDA_TOOLKIT_ROOT_DIR not found or specified                                                                                                                                       
-- Could NOT find CUDA (missing: CUDA_TOOLKIT_ROOT_DIR CUDA_NVCC_EXECUTABLE CUDA_INCLUDE_DIRS)

I didn’t look into that yet, but assume it’s probably looking for nvcc to figure out where CUDA’s installed.

hartbx · March 19, 2020, 6:53pm

I’ve also asked NVIDIA if they’ll consider updating the stub library’s cuInit() routine to return a well-defined error result (CUDA_STUB_LIBRARY or somesuch). That would make it immediately obvious what’s going on in cases like these.

Shisho_Sama · May 2, 2021, 10:33am

I’m facing the same issue on Windows 10.
I’m using Pytorch 1.7.1cu101 and The drivers are up to date(i.e. 466.11), Pytorch detects the GPU and thus torch.cuda.is_available() returns True, however, its the libtorch that fails! and always returns false.

I have tested with older driver as well (457? I guess) to no avail!
I also uninstalled all versions of cuda toolkits available on the system, hoping that might have had something to do with it, as they were in PATH and they might have caused a conflict with locally shipped dlls from libtorch libs, that wasnt the case either!
So what could be the issue here?
Here is the output of collect_env:

PS C:\Users\user> python collect_env.py
Collecting environment information...
PyTorch version: 1.7.0
Is debug build: True
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Enterprise
GCC version: Could not collect
Clang version: Could not collect
CMake version: version 3.9.0

Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.1.105
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1080
Nvidia driver version: 466.11
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip] numpy==1.18.1
[pip] numpydoc==0.9.2
[pip] torch==1.7.0
[pip] torchaudio==0.7.0
[pip] torchvision==0.8.1
[conda] blas                      1.0                         mkl
[conda] mkl                       2020.0                      166
[conda] mkl-service               2.3.0            py37hb782905_0
[conda] mkl_fft                   1.0.15           py37h14836fe_0
[conda] mkl_random                1.1.0            py37h675688f_0
[conda] numpy                     1.18.1           py37h93ca92e_0
[conda] numpy-base                1.18.1           py37hc3f5095_1
[conda] numpydoc                  0.9.2                      py_0
[conda] torch                     1.7.0                    pypi_0    pypi
[conda] torchaudio                0.7.0                    pypi_0    pypi
[conda] torchvision               0.8.1                    pypi_0    pypi

Any help is greatly appreciated

Shisho_Sama · May 3, 2021, 3:44am

OK, found the culprit!
It seems to be a known bug as stated here, and the reason is the cuda.dll wont get loaded for some reason and thus you face this issue.
To get around this issue, one way that was proposed was to add this :

/INCLUDE:"?warp_size@cuda@at@@YAHXZ"

to the linker flags. (in VS this means to add it to the linker/commands section)
this solves the issue.