Hi,
There are already a ton of posts on this topic of " torch is not recognizing GPU backend".
Each of the solutions mentioned is per case (such as upgrade or downgrade cuda or driver to some magic number for a specific torch version; see 1, 2, 3, 4, 5, 6 …), and none of those answers seem generic enough to address most people, so they keep coming with new versions of torch.
So, here is what I learned
Currently
$ python -c 'import torch; print(torch.cuda.is_available())'
False
But why? I got nvidia-smi
showing GPU correctly.
One sure-shot way of fixing my cuda lib compatibility problems (not the desirable, though!) is asking tensorflow
$ python -c 'import tensorflow as tf; print(tf.test.is_gpu_available())'
2019-11-18 22:56:28.982050: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:00:06.0
2019-11-18 22:56:28.986522: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
2019-11-18 22:56:28.989327: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory
2019-11-18 22:56:28.992141: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory
2019-11-18 22:56:28.994862: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory
2019-11-18 22:56:28.997474: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory
2019-11-18 22:56:29.000329: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory
2019-11-18 22:56:29.003020: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory
2019-11-18 22:56:29.003104: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2019-11-18 22:56:29.003232: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-18 22:56:29.003284: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2019-11-18 22:56:29.003372: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
False
TF precisely states Could not load dynamic library <namehere>
, example libcudnn.so.7
or 'libcublas.so.10.0'
which is a crucial info to know what cuda libs and versions are missing.
If there is a way to get such debug info from torch? please let me know. (we would be pleased to not depend on TF to fix it). Should the torch.cuda.is_available()
have debug=True
argument to print which of the missing libraries are causing it to return False
?
With this, if we figure out the missing libs and version are:
Could not load dynamic library 'libcudnn.so.7';
or 'libcublas.so.10.0'
We can do
conda install cudnn=7 cudatoolkit=10.0 -c anaconda
then for sure torch recognizes GPU backend:
$ python -c 'import torch; print(torch.cuda.is_available())'
True
But we have to know first that the missing libs are 'libcudnn.so.7'
and 'libcublas.so.10.0'
to install cudnn=7 cudatoolkit=10.0
(otherwise it goes to trial-error and magic numbers)