Torch::cuda::is_available() returns false

I have built PyTorch v1.4.0 locally and I am trying the C++ frontend.

I am seeing that the CPP API - torch::cuda::is_available() returns false.
With the following snippet:

#include <torch/torch.h>

int main() 
{
    torch::Tensor a = torch::ones({ 2, 2 }, torch::requires_grad());
    torch::Tensor b = torch::randn({ 2, 2 });
    a.to(torch::kCUDA);    
    b.to(torch::kCUDA);   
    auto c = a + b;
}

I saw the complaint:
what(): CUDA error: CUDA driver version is insufficient for CUDA runtime version

However in python I don’t see any such issue:

>>> torch.cuda.is_available()
True
>>> a = torch.tensor([1., 2.], device="cuda")
>>> a
tensor([1., 2.], device='cuda:0')

What could be wrong here?

When you try torch::cuda::is_available() return false?
right?

Try modifying it like the code below.

    torch::Tensor a = torch::ones({ 2, 2 }, torch::requires_grad());
    torch::Tensor b = torch::randn({ 2, 2 });
    a = a.to(torch::kCUDA);    
    b = b.to(torch::kCUDA); 

Also check your system.
You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

Cuda information collected is:

Is CUDA available: Yes
CUDA runtime version: 10.2.89
Nvidia driver version: 440.33.01

I work with @Deepali and have been looking at this as well. I think I see what’s happening now…

The libcuda.so that’s being loaded at run time is the stub version that comes with CUDA Toolkit. That’s suitable for dynamic linking at build time, but isn’t actually functional.

$ LD_DEBUG=all ./libtest 2>&1 | grep 'object=.*libcuda[.]so'
     95995:     object=/opt/anaconda3/envs/test_env/lib64/stubs/libcuda.so.1 [0]
     95995:     object=/opt/anaconda3/envs/test_env/lib64/stubs/libcuda.so.1 [0] 

And that library is being found because of the RPATH in the built binary:

$ objdump -p libtest | grep RPATH
  RPATH                [...]:/opt/anaconda3/envs/test_env/lib64/stubs:[...]

Unfortunately RPATH takes precedence over LD_LIBRARY_PATH :

$ LD_LIBARY_PATH=/usr/lib64 ./libtest
device_count: 0
Is available? 0

But LD_PRELOAD would get around the problem:

$ LD_PRELOAD=/usr/lib64/libcuda.so.440.33.01 ./libtest
device_count: 4
Is available? 1

So I think the trick will be to find a way to prevent the CUDA Toolkit stubs/ library directory from being added to RPATH.

I’m not sure whether it’s something specific to our environment (we’re building in a conda setup) that causes that, or if it would be something in the general libtorch stuff.

I tried just not installing the CUDA Toolkit development stuff (leaving only the runtime installed), but then the Torch cmake find_package stuff fails:

CUDA_TOOLKIT_ROOT_DIR not found or specified                                                                                                                                       
-- Could NOT find CUDA (missing: CUDA_TOOLKIT_ROOT_DIR CUDA_NVCC_EXECUTABLE CUDA_INCLUDE_DIRS)

I didn’t look into that yet, but assume it’s probably looking for nvcc to figure out where CUDA’s installed.

1 Like

I’ve also asked NVIDIA if they’ll consider updating the stub library’s cuInit() routine to return a well-defined error result (CUDA_STUB_LIBRARY or somesuch). That would make it immediately obvious what’s going on in cases like these.