I just got torch and CUDA (11.7) set-up on my device and am able to verify that cuda.is_available() and is being used. However, when I run a script in a Python3.8.10 virtual env with all the necessary modules, I get the following error:
Could not load library libcudnn_cnn_train.so.8. Error: /home/aryaman.pandya/Desktop/gpu_ml/lib/python3.9/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_ops_train.so.8: undefined symbol: _Z20traceback_iretf_implPKcRKN5cudnn16InternalStatus_tEb, version libcudnn_ops_infer.so.8
I’m not sure how to troubleshoot further since this is a binary file and haven’t been able to find solutions online. Would appreciate any help.
Could you describe how you’ve installed PyTorch and if you are mixing different installs in your current environment?
Could you also create a new virtual environment and check if reinstalling the PyTorch binaries would solve the issue?
Running that exact code in my venv may have helped me with the root cause. Here’s an error message:
>>> x = torch.randn(1, 3, 224, 224).cuda()
>>> conv = torch.nn.Conv2d(3, 3, 3).cuda()
>>> out = conv(x)
tensor(544.2900, device='cuda:0', grad_fn=<SumBackward0>)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/aryaman.pandya/gpu_ml/lib/python3.8/site-packages/torch/backends/cudnn/__init__.py", line 68, in version
if not _init():
File "/home/aryaman.pandya/gpu_ml/lib/python3.8/site-packages/torch/backends/cudnn/__init__.py", line 50, in _init
RuntimeError: cuDNN version incompatibility: PyTorch was compiled against (8, 5, 0) but found runtime version (8, 2, 1). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN.Looks like your LD_LIBRARY_PATH contains incompatible version of cudnnPlease either remove it from the path or install cudnn (8, 5, 0)
Any idea why my cuDNN version could be wrong? I assumed it was installed with the rest of the CUDA package.
Thanks so much for the help. I’m still a bit confused, so bear with me… I took a look at the LD_LIBRARY_PATH and it’s set to /usr/local/cuda-11.7/lib64
Within this directory there were a bunch of libcudnn* files. Are you suggesting I should remove those binaries?
Edit: I had two paths appended, one related to another project. Taking the second one out fixed it. Thanks so much for your help, really appreciate the work you do @ptrblck
Good to hear you’ve solved the issue!
I don’t fully understand why it’s failing at all, since we’ve forced the usage of RPATH (instead of the default RUNPATH), so LD_LIBRARY_PATH should not search for another libcudnn* in this PR. Let me check, why it was failing for you.