Segmentation fault from libtorch_cuda_cpp.so

Hi there. I’m trying to run Adversarial Example Generation — PyTorch Tutorials 1.8.1+cu102 documentation. Everything works fine for first 651 pictures and then I’m getting segmentation fault error. I checked memory usage on GPU(GTX 1050) and seems fine. I also run the same code on my friend’s GTX 1050ti and it worked fine. I re-install Ubuntu and clean set up of driver and CUDA tools and the problem is still there. I execute the code with GNU debugger and here is what I’ve got after 651 pictures:

Thread 12 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffedcdb5700 (LWP 8697)]
0x00007fff32aa46ca in std::_Hashtable<at::native::ConvolutionParams, std::pair<at::native::ConvolutionParams const, cudnnConvolutionBwdDataAlgoPerf_t>, std::allocator<std::pair<at::native::ConvolutionParams const, cudnnConvolutionBwdDataAlgoPerf_t> >, std::__detail::_Select1st, at::native::ParamsEqual<at::native::ConvolutionParams>, at::native::ParamsHash<at::native::ConvolutionParams>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, at::native::ConvolutionParams const&, unsigned long) const ()
   from /home/muco/.local/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so
$nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0

$nvidia-smi
Thu May 20 16:12:04 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.80       Driver Version: 460.80       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 1050    Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   57C    P0    N/A /  N/A |   1123MiB /  4040MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       849      G   /usr/lib/xorg/Xorg                204MiB |
|    0   N/A  N/A      1332      G   budgie-wm                          23MiB |
|    0   N/A  N/A      1673      G   ...AAAAAAAAA= --shared-files       58MiB |
|    0   N/A  N/A      6801      G   ...AAAAAAAAA= --shared-files       61MiB |
|    0   N/A  N/A      8679      C   /usr/bin/python3                  769MiB |

CuDNN Version: 8.1.0
Ubuntu Version: 20.04
GCC Version: 9.3.0
Python Version: 3.8.5

I’m wondering is this a hardware problem or some kind of bug?

Were you able to see the backtrace of the segfault?
Also, which PyTorch versions are you using and are you comparing the same versions across machines?

Full output of gdb:

Reading symbols from python3...
(No debugging symbols found in python3)
Starting program: /usr/bin/python3 adversarial.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 9786]
[New Thread 0x7fff09966700 (LWP 9788)]
[New Thread 0x7fff07165700 (LWP 9789)]
[New Thread 0x7fff04964700 (LWP 9790)]
[New Thread 0x7ffef9d6a700 (LWP 9791)]
[New Thread 0x7ffef9569700 (LWP 9792)]
[New Thread 0x7ffef8d68700 (LWP 9793)]
[New Thread 0x7ffeebfff700 (LWP 9794)]
[New Thread 0x7ffeea118700 (LWP 9796)]
[New Thread 0x7ffee919c700 (LWP 9797)]
[New Thread 0x7ffee899b700 (LWP 9798)]
[New Thread 0x7ffedd069700 (LWP 9799)]

Thread 12 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffedd069700 (LWP 9799)]
0x00007fff32aa26ca in std::_Hashtable<at::native::ConvolutionParams, std::pair<at::native::ConvolutionParams const, cudnnConvolutionBwdDataAlgoPerf_t>, std::allocator<std::pair<at::native::ConvolutionParams const, cudnnConvolutionBwdDataAlgoPerf_t> >, std::__detail::_Select1st, at::native::ParamsEqual<at::native::ConvolutionParams>, at::native::ParamsHash<at::native::ConvolutionParams>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, at::native::ConvolutionParams const&, unsigned long) const () from /home/muco/.local/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so

Yes, PyTorch 1.8.1 is installed on both of machines.

Same code works on cpu very well on my pc.