Segmentation fault Debug

Dear all,

When I train my model, I sometimes encounter segment fault error and no other information to help me debug. The error is quite random, for example, maybe after a few epochs (So it is unlikely caused by the dataloader).

The pytorch version 1.1.0, gpu Nvidia TITAN XP, Ubuntu 16.04.3 LTS, and CUDA 9.0.

I use gdb to recode the information and get the following:

#0 0x00007fffefbe42e1 in std::_Hashtable<std::type_index, std::pair<std::type_index const, pybind11::detail::type_info*>, std::allocator<std::pair<std::type_index const, pybind11::detail::type_info*> >, std::__detail::_Select1st, std::equal_tostd::type_index, std::hashstd::type_index, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::find(std::type_index const&) ()
from /home/wangkx/anaconda2/lib/python2.7/site-packages/torch/lib/libtorch_python.so
#1 0x00007fffefbe8a98 in pybind11::detail::get_type_info(std::type_index const&, bool) () from /home/wangkx/anaconda2/lib/python2.7/site-packages/torch/lib/libtorch_python.so
#2 0x00007fffefc73ec7 in pybind11::detail::type_caster_generic::src_and_type(void const*, std::type_info const&, std::type_info const*) ()
from /home/wangkx/anaconda2/lib/python2.7/site-packages/torch/lib/libtorch_python.so
#3 0x00007fffeff07243 in void pybind11::cpp_function::initialize<torch::jit::tracer::initPythonTracerBindings(_object*)::{lambda()#11}, std::shared_ptrtorch::jit::tracer::TracingState, , pybind11::name, pybind11::scope, pybind11::sibling>(torch::jit::tracer::initPythonTracerBindings(_object*)::{lambda()#11}&&, std::shared_ptrtorch::jit::tracer::TracingState ()(), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) ()
from /home/wangkx/anaconda2/lib/python2.7/site-packages/torch/lib/libtorch_python.so
#4 0x00007fffefbec0ea in pybind11::cpp_function::dispatcher(_object
, _object*, _object*) ()
from /home/wangkx/anaconda2/lib/python2.7/site-packages/torch/lib/libtorch_python.so
#5 0x00007ffff7adc615 in PyEval_EvalFrameEx () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#6 0x00007ffff7ade4e9 in PyEval_EvalCodeEx () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#7 0x00007ffff7a66fda in function_call () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#8 0x00007ffff7a42773 in PyObject_Call () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#9 0x00007ffff7a5150d in instancemethod_call () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#10 0x00007ffff7a42773 in PyObject_Call () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#11 0x00007ffff7a9b574 in slot_tp_call () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#12 0x00007ffff7a42773 in PyObject_Call () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#13 0x00007ffff7ad653b in PyEval_EvalFrameEx () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#14 0x00007ffff7ade4e9 in PyEval_EvalCodeEx () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#15 0x00007ffff7a670c7 in function_call () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#16 0x00007ffff7a42773 in PyObject_Call () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#17 0x00007ffff7ad74d0 in PyEval_EvalFrameEx () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#18 0x00007ffff7ade4e9 in PyEval_EvalCodeEx () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#19 0x00007ffff7a66fda in function_call () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#20 0x00007ffff7a42773 in PyObject_Call () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#21 0x00007ffff7a5150d in instancemethod_call () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#22 0x00007ffff7a42773 in PyObject_Call () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#23 0x00007ffff7a9b574 in slot_tp_call () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#24 0x00007ffff7a42773 in PyObject_Call () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#25 0x00007ffff7ad653b in PyEval_EvalFrameEx () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#26 0x00007ffff7ade4e9 in PyEval_EvalCodeEx () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#27 0x00007ffff7a670c7 in function_call () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#28 0x00007ffff7a42773 in PyObject_Call () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#29 0x00007ffff7ad74d0 in PyEval_EvalFrameEx () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#30 0x00007ffff7ade4e9 in PyEval_EvalCodeEx () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#31 0x00007ffff7a66fda in function_call () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#32 0x00007ffff7a42773 in PyObject_Call () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#33 0x00007ffff7a5150d in instancemethod_call () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#34 0x00007ffff7a42773 in PyObject_Call () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0
#35 0x00007ffff7a9b574 in slot_tp_call () from /home/wangkx/anaconda2/bin/…/lib/libpython2.7.so.1.0

I really do not know how to debug next. Please help me if you know any ideas.

1 Like