I am running a docker container based on official pytorch/pytorch:1.7.1-cuda11.0-cudnn8-runtime,
I am also using onnxruntime-gpu package to serve the models from the container. However onnxruntime fails with
File "/home/mrc/.local/lib/python3.8/site-packages/onnxruntime/__init__.py", line 24, in <module>
from onnxruntime.capi._pybind_state import get_all_providers, get_available_providers, get_device, set_seed, \
File "/home/mrc/.local/lib/python3.8/site-packages/onnxruntime/capi/_pybind_state.py", line 9, in <module>
import onnxruntime.capi._ld_preload # noqa: F401
File "/home/mrc/.local/lib/python3.8/site-packages/onnxruntime/capi/_ld_preload.py", line 13, in <module>
_libcudnn = CDLL("libcudnn.so.8", mode=RTLD_GLOBAL)
File "/opt/conda/lib/python3.8/ctypes/__init__.py", line 373, in __init__
self._handle = _dlopen(self._name, mode)
OSError: libcudnn.so.8: cannot open shared object file: No such file or directory
Based on the naming of the container it seems cudnn is installed and you could check the used version via print(torch.backends.cudnn.version()).
The error seems to be raised by onnxruntime and I don’t know how you’ve built/installed it and what might be the issue.
So, where is the libcudnn binary that pytorch is using?
Edit: So I dug into the source code a bit, and it looks like pytorch has a completely separate implementation of cuDNN inside it’s own codebase. Is this true?
But isn’t it odd? The *-runtime package claims to have cudnn installed (and we see it through torch.backends) but it’s not actually there?
I don’t think I am actually compiling anything inside the container (but I could be wrong, maybe onnx does something special), I install onnxruntime-gpu through pip, and it fails during import when it tries to load libcudnn and cannot find it. I myself cannot find cudnn anywhere in the system, so pytorch must be doing something else here, no?
It’s installed in the PyTorch binaries and is most likely linked statically.
This would mean that pnnxruntime-gpu doesn’t ship with its own statically linked cudnn, but is trying to dynamically link it from the system installation.
Yes, statically linking it and probably removing it afterwards to lower the size. If you need the local libs, you would have to use the devel container (or reinstall it into the runtime container).