Could not load library libcudnn_cnn_train.so.8 while training ConvNet

Aryaman_Pandya · January 29, 2023, 11:31pm

I just got torch and CUDA (11.7) set-up on my device and am able to verify that cuda.is_available() and is being used. However, when I run a script in a Python3.8.10 virtual env with all the necessary modules, I get the following error:

Could not load library libcudnn_cnn_train.so.8. Error: /home/aryaman.pandya/Desktop/gpu_ml/lib/python3.9/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_ops_train.so.8: undefined symbol: _Z20traceback_iretf_implPKcRKN5cudnn16InternalStatus_tEb, version libcudnn_ops_infer.so.8

I’m not sure how to troubleshoot further since this is a binary file and haven’t been able to find solutions online. Would appreciate any help.

ptrblck · January 30, 2023, 12:23am

Could you describe how you’ve installed PyTorch and if you are mixing different installs in your current environment?
Could you also create a new virtual environment and check if reinstalling the PyTorch binaries would solve the issue?

Aryaman_Pandya · January 30, 2023, 12:35am

Hey @ptrblck ! Thanks for the quick response. I actually tried what you described re: making a new venv. This didn’t fix the issue unfortunately.

In terms of installing, I used the pip option here.

OS: Ubuntu 20.04

ptrblck · January 30, 2023, 2:44am

Thanks for the follow up. I’m also using the pip wheels with CUDA 11.7 in different environments and did not encounter this issue, so would need to get more information about how to reproduce it.

ptrblck · January 30, 2023, 3:22am

I’ve created a new virtual environment with Python 3.9 and installed the current 1.13.1+cu117 pip wheels via pip install torch which still works:

>>> import torch
>>> torch.__version__
'1.13.1+cu117'
>>> torch.__path__
['/opt/miniforge3/envs/1.13.1_cu117_py39/lib/python3.9/site-packages/torch']
>>> x = torch.randn(1, 3, 224, 224).cuda()
>>> conv = torch.nn.Conv2d(3, 3, 3).cuda()
>>> out = conv(x)
>>> print(out.sum())
tensor(-6932.8076, device='cuda:0', grad_fn=<SumBackward0>)
>>> torch.backends.cudnn.version()
8500

ls /opt/miniforge3/envs/1.13.1_cu117_py39/lib/python3.9/site-packages/torch/../nvidia/cudnn/lib/libcudnn_ops_train.so.8 
-rw-rw-r-- 1 ptrblck ptrblck 67609960 Jan 29 19:16 /opt/miniforge3/envs/1.13.1_cu117_py39/lib/python3.9/site-packages/torch/../nvidia/cudnn/lib/libcudnn_ops_train.so.8

Aryaman_Pandya · January 30, 2023, 3:37am

Running that exact code in my venv may have helped me with the root cause. Here’s an error message:

>>> torch.__path__
['/home/aryaman.pandya/gpu_ml/lib/python3.8/site-packages/torch']
>>> x = torch.randn(1, 3, 224, 224).cuda()
>>> conv = torch.nn.Conv2d(3, 3, 3).cuda()
>>> out = conv(x)
>>> print(out.sum())
tensor(544.2900, device='cuda:0', grad_fn=<SumBackward0>)
>>> torch.backends.cudnn.version()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/aryaman.pandya/gpu_ml/lib/python3.8/site-packages/torch/backends/cudnn/__init__.py", line 68, in version
    if not _init():
  File "/home/aryaman.pandya/gpu_ml/lib/python3.8/site-packages/torch/backends/cudnn/__init__.py", line 50, in _init
    raise RuntimeError(f'{base_error_msg}'
RuntimeError: cuDNN version incompatibility: PyTorch was compiled  against (8, 5, 0) but found runtime version (8, 2, 1). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN.Looks like your LD_LIBRARY_PATH contains incompatible version of cudnnPlease either remove it from the path or install cudnn (8, 5, 0)

Any idea why my cuDNN version could be wrong? I assumed it was installed with the rest of the CUDA package.

ptrblck · January 30, 2023, 3:46am

Yes, cuDNN is a dependency and the PyTorch pip wheels will pull them as shown during the install steps:

pip install torch
Collecting torch
  Downloading torch-1.13.1-cp39-cp39-manylinux1_x86_64.whl (887.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 887.4/887.4 MB 58.3 MB/s eta 0:00:00
Collecting nvidia-cublas-cu11==11.10.3.66
  Downloading nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl (317.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 317.1/317.1 MB 58.5 MB/s eta 0:00:00
Collecting typing-extensions
  Downloading typing_extensions-4.4.0-py3-none-any.whl (26 kB)
Collecting nvidia-cuda-runtime-cu11==11.7.99
  Downloading nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl (849 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 849.3/849.3 kB 78.2 MB/s eta 0:00:00
Collecting nvidia-cuda-nvrtc-cu11==11.7.99
  Downloading nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl (21.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21.0/21.0 MB 58.4 MB/s eta 0:00:00
Collecting nvidia-cudnn-cu11==8.5.0.96
  Downloading nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl (557.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 557.1/557.1 MB 58.3 MB/s eta 0:00:00
...
Successfully installed nvidia-cublas-cu11-11.10.3.66 nvidia-cuda-nvrtc-cu11-11.7.99 nvidia-cuda-runtime-cu11-11.7.99 nvidia-cudnn-cu11-8.5.0.96 torch-1.13.1 typing-extensions-4.4.0

Check where cudnn==8.2 is located on your system and in case you are pointing towards it from your LD_LIBRARY_PATH remove it from there.

Aryaman_Pandya · January 30, 2023, 3:57am

Thanks so much for the help. I’m still a bit confused, so bear with me… I took a look at the LD_LIBRARY_PATH and it’s set to /usr/local/cuda-11.7/lib64
Within this directory there were a bunch of libcudnn* files. Are you suggesting I should remove those binaries?

Edit: I had two paths appended, one related to another project. Taking the second one out fixed it. Thanks so much for your help, really appreciate the work you do @ptrblck

ptrblck · January 30, 2023, 4:33am

Good to hear you’ve solved the issue!
I don’t fully understand why it’s failing at all, since we’ve forced the usage of RPATH (instead of the default RUNPATH), so LD_LIBRARY_PATH should not search for another libcudnn* in this PR. Let me check, why it was failing for you.

Bhargav_P_Raj · March 17, 2023, 2:50pm

In which file can I find the $LD_LIBRARY_PATH? I’ve multiple paths set but not able to remove any.

Bhargav_P_Raj · March 17, 2023, 2:51pm

Also can tell how to take out multiple paths?

Aryaman_Pandya · March 17, 2023, 6:25pm

The paths should be declared in your ~/.bashrc

gram · May 25, 2023, 3:31pm

Hie,

I kinda have the same error, I have explained my error in the thread cuDNN version incompatibility - vision - PyTorch Forums and got a quick response from him. As a beginner I am not able to understand what he said, well in my .bashrc file I found 3 lines

export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda/include:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64

I don’t know what to do from here. So can you please guide me. Thanks in advance.

lycnight · October 20, 2023, 2:39am

hello! you can find it by

vim ~./bashrc