Compiling OpenCV with CUDA 11.8 breaks PyTorch convolution

love_ptrblck · May 26, 2024, 2:28am

System Information

OpenCV version: 4.9.0
Operating System / Platform: Tested with Ubuntu 20.04 and 22.04
Python version: Tested with 3.9 and 3.10

Detailed description

Importing custom built OpenCV with CUDA 11.8 in python breaks pytorch.
The exact same building configuration only except with CUDA 12.1 does not cause this problem.

Steps to reproduce

CUDA 11.8 OpenCV building Dockerfile: cuda-ffmpeg-opencv-docker/docker-images/cuda-11.8.0/ffmpeg-6.1/opencv-4.9.0/python-3.10/Dockerfile at master · minostauros/cuda-ffmpeg-opencv-docker · GitHub
CUDA 12.1 OpenCV building Dockerfile: cuda-ffmpeg-opencv-docker/docker-images/cuda-12.1.1/ffmpeg-6.1/opencv-4.9.0/python-3.10/Dockerfile at master · minostauros/cuda-ffmpeg-opencv-docker · GitHub

These two Dockerfiles are the same except the base images and target CUDA_ARCH_BIN. opencv - Diffchecker

    cmake \
      -D CMAKE_BUILD_TYPE=RELEASE \
      -D BUILD_PYTHON_SUPPORT=ON \
      -D BUILD_DOCS=ON \
      -D BUILD_PERF_TESTS=OFF \
      -D BUILD_TESTS=OFF \
      -D CMAKE_INSTALL_PREFIX=/usr/local \
      -D OPENCV_EXTRA_MODULES_PATH=/opencv_contrib/modules \
      -D BUILD_opencv_python3=$( [ ${PYTHON_VERSION%%.*} -ge 3 ] && echo "ON" || echo "OFF" ) \
      -D BUILD_opencv_python2=$( [ ${PYTHON_VERSION%%.*} -lt 3 ] && echo "ON" || echo "OFF" ) \
      -D PYTHON${PYTHON_VERSION%%.*}_EXECUTABLE=$(which python${PYTHON_VERSION}) \
      -D PYTHON_DEFAULT_EXECUTABLE=$(which python${PYTHON_VERSION}) \
      -D BUILD_EXAMPLES=OFF \
      -D WITH_IPP=OFF \
      -D WITH_FFMPEG=ON \
      -D WITH_GSTREAMER=ON \
      -D WITH_V4L=ON \
      -D WITH_LIBV4L=ON \
      -D WITH_TBB=ON \
      -D WITH_QT=ON \
      -D WITH_OPENGL=ON \
      -D WITH_CUDA=ON \
      -D WITH_LAPACK=ON \
      #-D WITH_HPX=ON \
      -D CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda \
      -D CMAKE_LIBRARY_PATH=/usr/local/cuda/lib64/stubs \
      # https://kezunlin.me/post/6580691f
      # https://stackoverflow.com/questions/28010399/build-opencv-with-cuda-support
      # https://en.wikipedia.org/wiki/CUDA#GPUs_supported
      # https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications
      -D CUDA_ARCH_BIN="3.5 3.7 5.0 5.2 6.0 6.1 7.0 7.5 8.0 8.6 8.7 8.9 9.0" \
      -D CUDA_ARCH_PTX="" \
      -D WITH_CUBLAS=ON \
      -D WITH_NVCUVID=ON \
      -D ENABLE_FAST_MATH=0 \
      -D CUDA_FAST_MATH=0 \
      -D ENABLE_PRECOMPILED_HEADERS=OFF \
      ..

The base ffmpeg images are identical only except the CUDA version, provided by NVIDIA. ffmpeg - Diffchecker
CUDA 11.8 base image: cuda-ffmpeg-docker/docker-images/ubuntu22.04/cuda-11.8.0/ffmpeg-6.1/Dockerfile at master · minostauros/cuda-ffmpeg-docker · GitHub
CUDA 12.1 base image: cuda-ffmpeg-docker/docker-images/ubuntu22.04/cuda-12.1.1/ffmpeg-6.1/Dockerfile at master · minostauros/cuda-ffmpeg-docker · GitHub

Then, the minimal reproducible code is

docker run --gpus all --rm -ti --ipc=host ghcr.io/minostauros/cuda-ffmpeg-opencv-docker:4.9.0-cu118-py310 bash
pip install torch==2.2.1+cu118 torchvision==0.17.1+cu118 torchaudio==2.2.1+cu118 --index-url https://download.pytorch.org/whl/cu118

import cv2
import torch

a = torch.nn.Conv1d(100,200,1).cuda()
b = torch.rand(100,100).cuda()
c = a(b)

Result

Segmentation fault (core dumped)

Without import cv2, the code works without Segmentation fault.
Tested on NVIDIA A100 machines, more than enough for the given code.

I doubled checked changing pytorch version to 2.1.2, 2.2.1, and 2.2.2 does not help.

Following link did not solve my problem by setting LD_LIBRARY_PATH="": Forward Pass on Conv2d Segfaults

ptrblck · May 26, 2024, 12:53pm

You could check the loaded libraries via LD_DEBUG=libs to see which library location might change if your OpenCV build is imported. Once isolated, make sure to use compatible versions.

love_ptrblck · May 26, 2024, 4:05pm

Thanks for the tip!

import cv2 does not reference /lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8, but import cv2 makes pytorch to reference /lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8 instead of /usr/local/lib/python3.10/dist-packages/nvidia/cudnn/lib/libcudnn_ops_infer.so.8.

I solved it by

rm /lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8
ln -s /usr/local/lib/python3.10/dist-packages/nvidia/cudnn/lib/libcudnn_ops_infer.so.8 /lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8

However, AFAIK it is not recommended to directly remove files in /lib/x86_64-linux-gnu/.
Do you have any other suggestion?

love_ptrblck · May 26, 2024, 4:29pm

Ok, so the problem was that torch+cu118 was built upon CuDNN 8.7 while the NVIDIA CUDA 11.8 image offers CuDNN 8.9

Removing /lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8 and using torch’s internal so file will instead break cv2 this time.

>>> ./classify
[ WARN:0@0.598] global init.hpp:32 checkVersions cuDNN reports version 8.7 which is not compatible with the version 8.9 with which OpenCV was built
Could not load library libcudnn_cnn_infer.so.8. Error: /lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8: undefined symbol: _ZN11nvrtcHelper4loadEb, version libcudnn_ops_infer.so.8
Aborted (core dumped)

Is there any possible way to let cv2 reference cudnn 8.9 while pytorch reference its own cudnn?

And this issue is narrowed down to PyTorch 2.2 conflict with locally installed cuDNN (version: · Issue #119989 · pytorch/pytorch · GitHub

However, will the change be applied to older torch versions like 2.1 and 2.2?

love_ptrblck · May 27, 2024, 3:46am

Solution

Since CuDNN minor releases are backward compatible, I could simply remove nvidia-cudnn-cu11==8.7.0.84 shipped with pytorch for now as my system-wide CuDNN is 8.9.
Then, pytorch uses the system-wide cudnn.

pip uninstall -y nvidia-cudnn-cu11

Not sure if this will be the final solution after the merge of [BE]: Update cudnn to 9.1.0.70 by eqy · Pull Request #123475 · pytorch/pytorch · GitHub