Segmentation fault (core dumped) while trainning

Hi,
When I train a model with pytorch, sometimes it breaks down after hundreds of iterations with segmentation fault (core dumped). No other error information is printed. Then I have to kill the python threads manually to release the GPU memory.
I ran the program with gdb python and got

[Thread 0x7fffd5e47700 (LWP 16952) exited]
[Thread 0x7fffd3646700 (LWP 16951) exited]
[Thread 0x7fffd8648700 (LWP 16953) exited]
[Thread 0x7fffd0e45700 (LWP 16954) exited]

Thread 98 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffdfe4b700 (LWP 15961)]
torch::autograd::InputBuffer::add (this=this@entry=0x7fffdfe4ab10, pos=pos@entry=0, 
    var=...) at torch/csrc/autograd/input_buffer.cpp:17
17      torch/csrc/autograd/input_buffer.cpp: No such file or directory.
(gdb) where
#0  torch::autograd::InputBuffer::add (this=this@entry=0x7fffdfe4ab10, pos=pos@entry=0, 
    var=...) at torch/csrc/autograd/input_buffer.cpp:17
#1  0x00007fff8f616aad in torch::autograd::Engine::evaluate_function (
    this=this@entry=0x7fff908e4b40 <engine>, task=...)
    at torch/csrc/autograd/engine.cpp:268
#2  0x00007fff8f61804e in torch::autograd::Engine::thread_main (
    this=0x7fff908e4b40 <engine>, graph_task=0x0) at torch/csrc/autograd/engine.cpp:144
#3  0x00007fff8f614e82 in torch::autograd::Engine::thread_init (
    this=this@entry=0x7fff908e4b40 <engine>, device=device@entry=2)
    at torch/csrc/autograd/engine.cpp:121
#4  0x00007fff8f63793a in torch::autograd::python::PythonEngine::thread_init (
    this=0x7fff908e4b40 <engine>, device=2) at torch/csrc/autograd/python_engine.cpp:28
#5  0x00007fff736f2c80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff7bc16ba in start_thread (arg=0x7fffdfe4b700) at pthread_create.c:333
#7  0x00007ffff78f73dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

The program can run correctly on another computer, so it seems that the breakdown is related to pytorch intallation or environment. I do not know how to fix it. Anyone can give some help?

Thanks~

What version of PyTorch are you using?

import torch
print(torch.__version__)

Hi, @colesbury I build and install Pytorch from source (master branch). The version is 0.2.0+d8ad5de
However on another server with pytorch 0.2.0_3, my program run correctly. Is it possible that the pytorch version is the reason for breakdown?

We fixed a few bugs in the last few days that could be related. (I think that version is from 2 weeks ago). Can you try building from the latest master and see if you still get the segmentation fault?

If it’s still a problem, a script that reproduces the crash would be really helpful.

Got it. thanks @colesbury, I will reinstall the latest pytorch and have a try.

HI, @colesbury, with latest pytorch master branch version, the training still breaks down after hundreds of iterations. However if I roll back to 0.2.0 release version, the breakdown never occurs (with commit id 0b92e5c). It seems that this bug is related to pytorch version.

Since my project contains many codes, I will rewrite a simple example to reproduce this breakdown bug, and show it to you.

Hi,
I have encountered the same issue weeks ago but could not solve it yet.
Here is the link to a pytorch discussion I opened: https://discuss.pytorch.org/t/segmentation-fault-in-input-buffer-cpp

Thx~ @rotabulo Till now I find that if I use torch.rand() to generate input data and label for CNN, this segment fault won’t occur (even with multiGPU). However if I load input data and label with DataLoader, this bug occurs. It seems that this seg fault is related to Dataloader. (Images are loaded with opencv).

This seg fault does not occur with 0.2.0 release version, and only occurs with the latest master version. I also find that if I use python3 not python2, this seg fault disappear, I guess it is related to the multi-thread dataloader.

Hoping these message is helpful to you.

this is a serious problem which i currently locate to multi-gpu DataParallel.
also dataParallel brings decrease in model accuracy. it is a huge performance gap.
hope @smth @adam to get a fix soon.

opencv is known to not work nicely with general multiprocessing in python. Could you try to reduce num_workers to 0?

Could you provide more details on the performance degrading you see please? Thanks.

There is even 30% in terms of accuracy for training performance for visual
question answering. I got the core dump before I hit the validation
epoch…

Do you mean a 30% decrease in accuracy? That is weird. Do you have a script to reproduce the issue?

30%. Decrease on training set.

Hello!

I think I am experiencing this issue as well. I only train on my CPU and use the DataLoader class. I experience core dumps if I set the num_workers parameter to at least 1, without I found no problems yet.

My PyTorch version is 0.2.0_4.

Hi @TStepi, do you mind sharing a snippet that can reproduce the issue? I can’t repro it.

Is the issue you are seeing a DataParallel issue or a DataLoader issue? Do you also see segfaults?

If you have a script, I can try to debug. From what I’ve been seeing it’s more likely caused by opencv. But it’s weird that it only happens on master.

i think it is a DataParallel issue.
whatever number of multi-processing worker i set to data-loader, it still dumps.
After I turn off the Multi-GPU, it is okay.

So it’s a completely different issue than this post then. If you can provide a script, we can look into it.