Segmentation fault from libtorch_cuda_cpp.so

mucozcan · May 20, 2021, 1:20pm

Hi there. I’m trying to run Adversarial Example Generation — PyTorch Tutorials 1.8.1+cu102 documentation. Everything works fine for first 651 pictures and then I’m getting segmentation fault error. I checked memory usage on GPU(GTX 1050) and seems fine. I also run the same code on my friend’s GTX 1050ti and it worked fine. I re-install Ubuntu and clean set up of driver and CUDA tools and the problem is still there. I execute the code with GNU debugger and here is what I’ve got after 651 pictures:

Thread 12 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffedcdb5700 (LWP 8697)]
0x00007fff32aa46ca in std::_Hashtable<at::native::ConvolutionParams, std::pair<at::native::ConvolutionParams const, cudnnConvolutionBwdDataAlgoPerf_t>, std::allocator<std::pair<at::native::ConvolutionParams const, cudnnConvolutionBwdDataAlgoPerf_t> >, std::__detail::_Select1st, at::native::ParamsEqual<at::native::ConvolutionParams>, at::native::ParamsHash<at::native::ConvolutionParams>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, at::native::ConvolutionParams const&, unsigned long) const ()
   from /home/muco/.local/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so

$nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0

$nvidia-smi
Thu May 20 16:12:04 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.80       Driver Version: 460.80       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 1050    Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   57C    P0    N/A /  N/A |   1123MiB /  4040MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       849      G   /usr/lib/xorg/Xorg                204MiB |
|    0   N/A  N/A      1332      G   budgie-wm                          23MiB |
|    0   N/A  N/A      1673      G   ...AAAAAAAAA= --shared-files       58MiB |
|    0   N/A  N/A      6801      G   ...AAAAAAAAA= --shared-files       61MiB |
|    0   N/A  N/A      8679      C   /usr/bin/python3                  769MiB |

CuDNN Version: 8.1.0
Ubuntu Version: 20.04
GCC Version: 9.3.0
Python Version: 3.8.5

I’m wondering is this a hardware problem or some kind of bug?

ptrblck · May 21, 2021, 9:03am

Were you able to see the backtrace of the segfault?
Also, which PyTorch versions are you using and are you comparing the same versions across machines?

mucozcan · May 21, 2021, 9:17am

Full output of gdb:

Reading symbols from python3...
(No debugging symbols found in python3)
Starting program: /usr/bin/python3 adversarial.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 9786]
[New Thread 0x7fff09966700 (LWP 9788)]
[New Thread 0x7fff07165700 (LWP 9789)]
[New Thread 0x7fff04964700 (LWP 9790)]
[New Thread 0x7ffef9d6a700 (LWP 9791)]
[New Thread 0x7ffef9569700 (LWP 9792)]
[New Thread 0x7ffef8d68700 (LWP 9793)]
[New Thread 0x7ffeebfff700 (LWP 9794)]
[New Thread 0x7ffeea118700 (LWP 9796)]
[New Thread 0x7ffee919c700 (LWP 9797)]
[New Thread 0x7ffee899b700 (LWP 9798)]
[New Thread 0x7ffedd069700 (LWP 9799)]

Thread 12 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffedd069700 (LWP 9799)]
0x00007fff32aa26ca in std::_Hashtable<at::native::ConvolutionParams, std::pair<at::native::ConvolutionParams const, cudnnConvolutionBwdDataAlgoPerf_t>, std::allocator<std::pair<at::native::ConvolutionParams const, cudnnConvolutionBwdDataAlgoPerf_t> >, std::__detail::_Select1st, at::native::ParamsEqual<at::native::ConvolutionParams>, at::native::ParamsHash<at::native::ConvolutionParams>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, at::native::ConvolutionParams const&, unsigned long) const () from /home/muco/.local/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so

Yes, PyTorch 1.8.1 is installed on both of machines.

Same code works on cpu very well on my pc.