I have read on multiple topics “The PyTorch binaries ship with all CUDA runtime dependencies and you don’t need to locally install a CUDA toolkit or cuDNN. Only a properly installed NVIDIA driver is needed to execute PyTorch workloads on the GPU.”
I have Pytorch 1.13.1+cu117 installed in my docker container.
My CUDA toolkit version is 11.8
# python3 -c "import torch; print(torch.__version__); print(torch.version.cuda); print(torch.cuda.is_available()); print(torch.backends.cudnn.version())"
1.13.1+cu117
11.7
True
8500
# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
I had been getting a segmentation fault like below.
--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0 at::_ops::conv2d::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long)
1 at::native::conv2d(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long)
2 at::_ops::convolution::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long)
3 at::_ops::convolution::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long)
4 at::native::convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long)
5 at::_ops::_convolution::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long, bool, bool, bool, bool)
6 at::native::_convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long, bool, bool, bool, bool)
7 at::_ops::cudnn_convolution::call(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool)
8 at::native::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool)
----------------------
Error Message Summary:
----------------------
FatalError: Segmentation fault is detected by the operating system.
[TimeInfo: *** Aborted at 1738625169 (unix time) try "date -d @1738625169" if you are using GNU date ***]
[SignalInfo: *** SIGSEGV (@0x200) received by PID 1095 (TID 0x7f5fde39f740) from PID 512 ***]
Segmentation fault (core dumped)
But it disappeared after I uninstalled the existing Pytorch and installed a new one compiled with cuda 11.8:
pip uninstall torch torchvision torchaudio -y
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
I am trying to make sense of why a cuda toolkit version mismatch with Pytorch should affect anything if the toolkit wasnt necessary in the first place. Any help/inputs would be greatly appreciated, thank you !!