I just got torch and CUDA (11.7) set-up on my device and am able to verify that cuda.is_available() and is being used. However, when I run a script in a Python3.8.10 virtual env with all the necessary modules, I get the following error:
Could not load library libcudnn_cnn_train.so.8. Error: /home/aryaman.pandya/Desktop/gpu_ml/lib/python3.9/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_ops_train.so.8: undefined symbol: _Z20traceback_iretf_implPKcRKN5cudnn16InternalStatus_tEb, version libcudnn_ops_infer.so.8
Iβm not sure how to troubleshoot further since this is a binary file and havenβt been able to find solutions online. Would appreciate any help.
Could you describe how youβve installed PyTorch and if you are mixing different installs in your current environment?
Could you also create a new virtual environment and check if reinstalling the PyTorch binaries would solve the issue?
Thanks for the follow up. Iβm also using the pip wheels with CUDA 11.7 in different environments and did not encounter this issue, so would need to get more information about how to reproduce it.
Running that exact code in my venv may have helped me with the root cause. Hereβs an error message:
>>> torch.__path__
['/home/aryaman.pandya/gpu_ml/lib/python3.8/site-packages/torch']
>>> x = torch.randn(1, 3, 224, 224).cuda()
>>> conv = torch.nn.Conv2d(3, 3, 3).cuda()
>>> out = conv(x)
>>> print(out.sum())
tensor(544.2900, device='cuda:0', grad_fn=<SumBackward0>)
>>> torch.backends.cudnn.version()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/aryaman.pandya/gpu_ml/lib/python3.8/site-packages/torch/backends/cudnn/__init__.py", line 68, in version
if not _init():
File "/home/aryaman.pandya/gpu_ml/lib/python3.8/site-packages/torch/backends/cudnn/__init__.py", line 50, in _init
raise RuntimeError(f'{base_error_msg}'
RuntimeError: cuDNN version incompatibility: PyTorch was compiled against (8, 5, 0) but found runtime version (8, 2, 1). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN.Looks like your LD_LIBRARY_PATH contains incompatible version of cudnnPlease either remove it from the path or install cudnn (8, 5, 0)
Any idea why my cuDNN version could be wrong? I assumed it was installed with the rest of the CUDA package.
Thanks so much for the help. Iβm still a bit confused, so bear with meβ¦ I took a look at the LD_LIBRARY_PATH and itβs set to /usr/local/cuda-11.7/lib64
Within this directory there were a bunch of libcudnn* files. Are you suggesting I should remove those binaries?
Edit: I had two paths appended, one related to another project. Taking the second one out fixed it. Thanks so much for your help, really appreciate the work you do @ptrblck
Good to hear youβve solved the issue!
I donβt fully understand why itβs failing at all, since weβve forced the usage of RPATH (instead of the default RUNPATH), so LD_LIBRARY_PATH should not search for another libcudnn* in this PR. Let me check, why it was failing for you.