How to get debug info from `torch.cuda.is_available()` about missing libs

t.g · November 18, 2019, 11:30pm

Hi,

There are already a ton of posts on this topic of " torch is not recognizing GPU backend".
Each of the solutions mentioned is per case (such as upgrade or downgrade cuda or driver to some magic number for a specific torch version; see 1, 2, 3, 4, 5, 6 …), and none of those answers seem generic enough to address most people, so they keep coming with new versions of torch.

So, here is what I learned

Currently

$ python -c 'import torch; print(torch.cuda.is_available())'
False

But why? I got nvidia-smi showing GPU correctly.

One sure-shot way of fixing my cuda lib compatibility problems (not the desirable, though!) is asking tensorflow

$ python -c 'import tensorflow as tf; print(tf.test.is_gpu_available())'

2019-11-18 22:56:28.982050: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:00:06.0
2019-11-18 22:56:28.986522: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
2019-11-18 22:56:28.989327: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory
2019-11-18 22:56:28.992141: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory
2019-11-18 22:56:28.994862: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory
2019-11-18 22:56:28.997474: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory
2019-11-18 22:56:29.000329: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory
2019-11-18 22:56:29.003020: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory
2019-11-18 22:56:29.003104: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2019-11-18 22:56:29.003232: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-18 22:56:29.003284: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0
2019-11-18 22:56:29.003372: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N
False

TF precisely states Could not load dynamic library <namehere>, example libcudnn.so.7 or 'libcublas.so.10.0' which is a crucial info to know what cuda libs and versions are missing.

If there is a way to get such debug info from torch? please let me know. (we would be pleased to not depend on TF to fix it). Should the torch.cuda.is_available() have debug=True argument to print which of the missing libraries are causing it to return False?

With this, if we figure out the missing libs and version are:
Could not load dynamic library 'libcudnn.so.7'; or 'libcublas.so.10.0'
We can do

conda install cudnn=7 cudatoolkit=10.0 -c anaconda

then for sure torch recognizes GPU backend:

$ python -c 'import torch; print(torch.cuda.is_available())'
True

But we have to know first that the missing libs are 'libcudnn.so.7' and 'libcublas.so.10.0' to install cudnn=7 cudatoolkit=10.0 (otherwise it goes to trial-error and magic numbers)

t.g · November 18, 2019, 11:49pm

Also, another topic related to getting extra info from debug=True:

$ python -c 'import torch; print(torch.cuda.is_available(), torch.version.cuda)'
False 10.0.130

$ python -c 'import tensorflow as tf; print(tf.test.is_gpu_available())'
...
E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
...
False

I guess cuInit: CUDA_ERROR_UNKNOWN: unknown error means I have to restart the machine. Knowing this errors would help as well.

albanD · November 19, 2019, 12:18am

Hi,

You can try to actually use a cuda element to see such errors: torch.rand(1, device="cuda").
Does that give you the informations you want?

t.g · November 19, 2019, 7:53am

Yes that helps. It prints useful info. Thanks.
Wish it was a documented feature!

My code branching has been

if torch.cuda.is_available():
    use cuda
else:
    use cpu, dont attempt to use of cuda

so I never had chance to see that message