Do pytorch containers come with CuDNN installed?

mkserge · April 20, 2021, 12:29am

Hello,

I am running a docker container based on official pytorch/pytorch:1.7.1-cuda11.0-cudnn8-runtime,

I am also using onnxruntime-gpu package to serve the models from the container. However onnxruntime fails with

  File "/home/mrc/.local/lib/python3.8/site-packages/onnxruntime/__init__.py", line 24, in <module>
    from onnxruntime.capi._pybind_state import get_all_providers, get_available_providers, get_device, set_seed, \
  File "/home/mrc/.local/lib/python3.8/site-packages/onnxruntime/capi/_pybind_state.py", line 9, in <module>
    import onnxruntime.capi._ld_preload  # noqa: F401
  File "/home/mrc/.local/lib/python3.8/site-packages/onnxruntime/capi/_ld_preload.py", line 13, in <module>
    _libcudnn = CDLL("libcudnn.so.8", mode=RTLD_GLOBAL)
  File "/opt/conda/lib/python3.8/ctypes/__init__.py", line 373, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcudnn.so.8: cannot open shared object file: No such file or directory

Inside the container I see the

root@fc13d70325fe:/# echo $LD_LIBRARY_PATH
/usr/local/nvidia/lib:/usr/local/nvidia/lib64

but there are no cudnn binaries in there.

Does anyone know what is causing the issue? Are the containers not coming pre-installed with cudnn, etc.?

Thank you,

S

ptrblck · April 20, 2021, 5:26am

Based on the naming of the container it seems cudnn is installed and you could check the used version via print(torch.backends.cudnn.version()).
The error seems to be raised by onnxruntime and I don’t know how you’ve built/installed it and what might be the issue.

mkserge · April 20, 2021, 1:38pm

Thanks @ptrblck !

Yeah, I see that

>>> print(torch.backends.cudnn.version())
8003

But I can’t find the lubcudnn binary anywhere in the container!

root@fc13d70325fe:/# find / -iname 'libcudnn*'
root@fc13d70325fe:/#

The other dependencies of ONNX runtime are there though

root@fc13d70325fe:/# find / -iname 'libcublas*'
/opt/conda/pkgs/cudatoolkit-11.0.221-h6bb024c_0/lib/libcublasLt.so.11.2.0.252
/opt/conda/pkgs/cudatoolkit-11.0.221-h6bb024c_0/lib/libcublasLt.so
/opt/conda/pkgs/cudatoolkit-11.0.221-h6bb024c_0/lib/libcublas.so.11.2.0.252
/opt/conda/pkgs/cudatoolkit-11.0.221-h6bb024c_0/lib/libcublas.so
/opt/conda/pkgs/cudatoolkit-11.0.221-h6bb024c_0/lib/libcublas.so.11
/opt/conda/pkgs/cudatoolkit-11.0.221-h6bb024c_0/lib/libcublasLt.so.11
/opt/conda/lib/libcublasLt.so.11.2.0.252
/opt/conda/lib/libcublasLt.so
/opt/conda/lib/libcublas.so.11.2.0.252
/opt/conda/lib/libcublas.so
/opt/conda/lib/libcublas.so.11
/opt/conda/lib/libcublasLt.so.11

So, where is the libcudnn binary that pytorch is using?

Edit: So I dug into the source code a bit, and it looks like pytorch has a completely separate implementation of cuDNN inside it’s own codebase. Is this true?

ptrblck · April 20, 2021, 6:52pm

No, PyTorch uses the official cudnn release and either links it dynamically or statically.

Note that you are using the runtime container, so nvcc isn’t installed either:

root@f79b17da2a55:/workspace# nvcc --version
bash: nvcc: command not found

Also, the lib path is also empty:

root@f79b17da2a55:/workspace# ls /usr/local/nvidia
ls: cannot access '/usr/local/nvidia': No such file or directory

If you want to build applications inside the container, use the devel container:

root@389363a6c5ec:/workspace# find /usr/ -name libcudnn.so
/usr/lib/x86_64-linux-gnu/libcudnn.so
root@389363a6c5ec:/workspace# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

mkserge · April 20, 2021, 8:11pm

Thanks @ptrblck , always helpful

But isn’t it odd? The *-runtime package claims to have cudnn installed (and we see it through torch.backends) but it’s not actually there?

I don’t think I am actually compiling anything inside the container (but I could be wrong, maybe onnx does something special), I install onnxruntime-gpu through pip, and it fails during import when it tries to load libcudnn and cannot find it. I myself cannot find cudnn anywhere in the system, so pytorch must be doing something else here, no?

ptrblck · April 20, 2021, 8:35pm

It’s installed in the PyTorch binaries and is most likely linked statically.

This would mean that pnnxruntime-gpu doesn’t ship with its own statically linked cudnn, but is trying to dynamically link it from the system installation.

Yes, statically linking it and probably removing it afterwards to lower the size. If you need the local libs, you would have to use the devel container (or reinstall it into the runtime container).

mkserge · April 21, 2021, 2:04pm

Alright, makes sense. Thank you @ptrblck !

juharris · June 10, 2021, 6:46pm

Just to help anyone else ending up here searching for solutions, running:

sudo apt install libcudnn8

or whatever version you need, could help you.