Pytorch and Cuda version compatibility

I have read on multiple topics “The PyTorch binaries ship with all CUDA runtime dependencies and you don’t need to locally install a CUDA toolkit or cuDNN. Only a properly installed NVIDIA driver is needed to execute PyTorch workloads on the GPU.”

I have Pytorch 1.13.1+cu117 installed in my docker container.
My CUDA toolkit version is 11.8

# python3 -c "import torch; print(torch.__version__); print(torch.version.cuda); print(torch.cuda.is_available()); print(torch.backends.cudnn.version())"
1.13.1+cu117
11.7

True
8500


# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

I had been getting a segmentation fault like below.

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   at::_ops::conv2d::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long)
1   at::native::conv2d(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long)
2   at::_ops::convolution::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long)
3   at::_ops::convolution::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long)
4   at::native::convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long)
5   at::_ops::_convolution::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long, bool, bool, bool, bool)
6   at::native::_convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long, bool, bool, bool, bool)
7   at::_ops::cudnn_convolution::call(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool)
8   at::native::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool)

----------------------
Error Message Summary:
----------------------
FatalError: Segmentation fault is detected by the operating system.
  [TimeInfo: *** Aborted at 1738625169 (unix time) try "date -d @1738625169" if you are using GNU date ***]
  [SignalInfo: *** SIGSEGV (@0x200) received by PID 1095 (TID 0x7f5fde39f740) from PID 512 ***]

Segmentation fault (core dumped)

But it disappeared after I uninstalled the existing Pytorch and installed a new one compiled with cuda 11.8:

pip uninstall torch torchvision torchaudio -y
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

I am trying to make sense of why a cuda toolkit version mismatch with Pytorch should affect anything if the toolkit wasnt necessary in the first place. Any help/inputs would be greatly appreciated, thank you !!

Yes, you don’t need to install a CUDA toolkit locally. However, you could check if PyTorch still tries to open locally installed CUDA or cuDNN libs by running your workload via LD_DEBUG=libs.
Especially in older PyTorch versions we used the RUNPATH to load libs which could prefer your local libs.

Thank you for the quick response! Below are a couple of things I tried:

This is the section of the log from LD_DEBUG=libs right before the segmentation fault in the container with Pytorch 1.13.1+cu117

451:	find library=libcudnn_cnn_infer.so.8 [0]; searching
       451:	 search path=/bgarage/onnxruntime-linux-x64-gpu-1.10.0/lib:/bgarage/opencv/opencv/lib		(LD_LIBRARY_PATH)
       451:	  trying file=/bgarage/onnxruntime-linux-x64-gpu-1.10.0/lib/libcudnn_cnn_infer.so.8
       451:	  trying file=/bgarage/opencv/opencv/lib/libcudnn_cnn_infer.so.8
       451:	 search path=/usr/local/lib/python3.8/dist-packages/torch/lib		(RUNPATH from file /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_global_deps.so)
       451:	  trying file=/usr/local/lib/python3.8/dist-packages/torch/lib/libcudnn_cnn_infer.so.8
       451:	
       451:	
       451:	calling init: /usr/local/lib/python3.8/dist-packages/torch/lib/libcudnn_cnn_infer.so.8
       451:	
--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
.
.
.

And this is the matching section from Pytorch 2.4.1+cu118

379:	find library=libcudnn_cnn_infer.so.8 [0]; searching
       379:	 search path=/bgarage/onnxruntime-linux-x64-gpu-1.10.0/lib:/bgarage/opencv/opencv/lib		(LD_LIBRARY_PATH)
       379:	  trying file=/bgarage/onnxruntime-linux-x64-gpu-1.10.0/lib/libcudnn_cnn_infer.so.8
       379:	  trying file=/bgarage/opencv/opencv/lib/libcudnn_cnn_infer.so.8
       379:	 search path=/usr/lib/x86_64-linux-gnu		(system search path)
       379:	  trying file=/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8
       379:	
       379:	
       379:	calling init: /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8

I tried this in the 11.7 container and the error disappeared. I wasn’t able to set RUNPATH so I modified the LD_LIBRARY_PATH(not sure if this is or isn’t recommended)
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/:$LD_LIBRARY_PATH

Also why do you think it might be failing on torch’s own cudnn in the first log?

I didn’t think it’s failing while loading cuDNN shipped in the PyTorch wheels, but claimed torch could dlopen it’s own libcudnn.so which could then dlopen sub-libs installed in your system.
Based on the output that’s not the case and I don’t remember seeing this error in 1.13.1+cu117 (but it was also released a long time ago).