Segmentation fault problem

Hi.
I have a segmentation fault problem while training with 4 GPUs.
The Pytorch version is 1.4.0 and I use nn.DataParallel to run 4 GPUs

Fatal Python error: Segmentation fault


Thread 0x00007fdf3134e740 (most recent call first):
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 99 in backward
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 195 in backward
  File "/root/project/v2/../libs/train_with_ft.py", line 82 in train
  File "/root/project/v2/../libs/train_with_ft.py", line 147 in main
  File "/root/project/v2/../libs/train_with_ft.py", line 181 in <module>
./run.sh: line 34: 15480 Segmentation fault      (core dumped) train_with_ft.py

Iā€™m struggling to solve this problem. Please help me.

Could you run your script via:

$ gdb --args python my_script.py
...
Reading symbols from python...done.
(gdb) run
...
(gdb) backtrace
...

and post the backtrace here, please?

This is the gdb backtrace result of my script.

[Thread 0x7f6e9dfff700 (LWP 2457) exited]
[Thread 0x7f6e9d7fe700 (LWP 2458) exited]
[Thread 0x7f6f167fc700 (LWP 2452) exited]
[Thread 0x7f6f1cff9700 (LWP 2450) exited]

Thread 76 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f6fa5fff700 (LWP 25182)]
0x00007f7044916d0b in std::pair<std::__detail::_Node_iterator<c10::Stream, true, false>, bool> std::_Hashtable<c10::Stream, c10::Stream, std::allocator<c10::Stream>, std::__detail::_Identity, std::equal_to<c10::Stream>, std::hash<c10::Stream>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, true, true> >::_M_emplace<c10::Stream const&>(std::integral_constant<bool, true>, c10::Stream const&) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so
(gdb) backtrace
#0  0x00007f7044916d0b in std::pair<std::__detail::_Node_iterator<c10::Stream, true, false>, bool> std::_Hashtable<c10::Stream, c10::Stream, std::allocator<c10::Stream>, std::__detail::_Identity, std::equal_to<c10::Stream>, std::hash<c10::Stream>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, true, true> >::_M_emplace<c10::Stream const&>(std::integral_constant<bool, true>, c10::Stream const&) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so
#1  0x00007f704491074c in torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&) () from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so
#2  0x00007f7044912082 in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&, bool) () from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so
#3  0x00007f704490b979 in torch::autograd::Engine::thread_init(int) () from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so
#4  0x00007f708ba9b08a in torch::autograd::python::PythonEngine::thread_init(int) () from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so
#5  0x00007f708c6b6def in execute_native_thread_routine () from /usr/local/lib/python3.6/dist-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so
#6  0x00007f70af1bb6db in start_thread (arg=0x7f6fa5fff700) at pthread_create.c:463
#7  0x00007f70af4f488f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

And I try calculating the mean and variance of GRU layer output in time domain for passing it to the Linear layer. Is this problem caused by that?

Hi,

Yes this was an issue in 1.4 with improper handling of cuda streams in the autograd engine. This is fixed in 1.5

2 Likes

Hi,

Thank you for your help. I upgraded to 1.5.0, and it seems to work.