Segmentation fault in input_buffer.cpp

Hi,
I encountered a segmentation fault issue with the latest master version of pytorch.
I managed to catch it with gdb and apparently the line where it happened is the following:

torch::autograd::InputBuffer::add (this=this@entry=0x7dc51afefb10, pos=pos@entry=0, |
var=…) at torch/csrc/autograd/input_buffer.cpp:17 |
17 if (!item.first.defined()) {

Any idea of what could cause it? Looks like an issue wit pytorch internals.
Consider that the my code used to perfectly work on a previous commit and the segmentation fault occurs after a random number of epochs/iterations.

Thanks.

Additional backtrace information:

#0 torch::autograd::InputBuffer::add (this=this@entry=0x7dc51afefb10, pos=pos@entry=0,
var=…) at torch/csrc/autograd/input_buffer.cpp:17
#1 0x00007fc6df0cb3ec in torch::autograd::Engine::evaluate_function (
this=this@entry=0x7fc6e03d4040 , task=…)
at torch/csrc/autograd/engine.cpp:268
#2 0x00007fc6df0cc84e in torch::autograd::Engine::thread_main (
this=0x7fc6e03d4040 , graph_task=0x0) at torch/csrc/autograd/engine.cpp:144
#3 0x00007fc6df0c9632 in torch::autograd::Engine::thread_init (
this=this@entry=0x7fc6e03d4040 , device=device@entry=3)
at torch/csrc/autograd/engine.cpp:121
#4 0x00007fc6df0ec11a in torch::autograd::python::PythonEngine::thread_init (
this=0x7fc6e03d4040 , device=3) at torch/csrc/autograd/python_engine.cpp:28
#5 0x00007fc693ebac80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007fc74773f6ba in start_thread (arg=0x7dc51aff0700) at pthread_create.c:333
#7 0x00007fc7474753dd in clone () at …/sysdeps/unix/sysv/linux/x86_64/clone.S:109

Hi, thanks for posting. Could you post a small code snippet for reproducing the issue?

Hi,
the problem is that I have currently no idea what might cause it and it is already difficult to actually catch it in gdb.
If you provide me with some information about the role of the piece of code that is issuing the segmentation fault, I could try to further investigate the cause.

The code is part of the autograd engine that sets input variables to backward functions. More often than not, reasons for segfaults are located at different places than the erroring line…

We just found out that we have two different learning tasks (one is segmentation, one classification) that are giving segfault in the exact same place, so the possibility that this is a side-effect of changes of memory locations (e.g. due to bad indexing) from other parts of the code is less likely (if this is what you refer to “different places that the erroring line”).

It is still entirely possible that the bug is in pytorch. I just mean that it is probably not in this line/function/file. So knowing what this line does likely won’t help. If it’s possible, could you post or send me the code that segfaults?

Ok, we try first to simplify the code as much as possible while verifying that the segfault issue persists. Once we get to a reasonably simplified setting, I will share the code with you.

We discovered that the reason why it seg faults is because next_fn->num_inputs is surprisingly zero, which leads to buffer having no elements in https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/input_buffer.cpp#L16. This in turn leads to variable “item” pointing to some invalid memory, which triggers the segmentation fault once we try to access field “first”. Is this helpful to get a better idea about the issue? We are currently trying to debug the pytorch internals.

2 Likes

Thanks for doing this!!

It definitely helps learning more about the issue. Yet it is still unclear to me how that happened. If possible, could you share the relevant graph composition? Ideally one that can reproduce this issue, e.g. a subgraph that errors the same way when called backward on it.

It seems like I am facing a similar issue (segfault at exactly the same line), though it seems to only happen when I am using multiple GPUs. In my case, it started happening after I added some code for accumulating gradients instead of updating at every batch.

I am also using pytorch from master.

@rotabulo, may I ask if you also get this when using only one GPU, or no GPUs at all?

@cesarsouza this happens with multiple gpus and we have a customized bn layer that we suspect is triggering the segfault in some way. The code used to work on previous versions of pytorch and training was working with our layer.

I have the same problem. Still not solved