Segmentation fault (core dumped) while trainning

EthanZhangYi · November 3, 2017, 3:33am

Hi,
When I train a model with pytorch, sometimes it breaks down after hundreds of iterations with segmentation fault (core dumped). No other error information is printed. Then I have to kill the python threads manually to release the GPU memory.
I ran the program with gdb python and got

[Thread 0x7fffd5e47700 (LWP 16952) exited]
[Thread 0x7fffd3646700 (LWP 16951) exited]
[Thread 0x7fffd8648700 (LWP 16953) exited]
[Thread 0x7fffd0e45700 (LWP 16954) exited]

Thread 98 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffdfe4b700 (LWP 15961)]
torch::autograd::InputBuffer::add (this=this@entry=0x7fffdfe4ab10, pos=pos@entry=0, 
    var=...) at torch/csrc/autograd/input_buffer.cpp:17
17      torch/csrc/autograd/input_buffer.cpp: No such file or directory.
(gdb) where
#0  torch::autograd::InputBuffer::add (this=this@entry=0x7fffdfe4ab10, pos=pos@entry=0, 
    var=...) at torch/csrc/autograd/input_buffer.cpp:17
#1  0x00007fff8f616aad in torch::autograd::Engine::evaluate_function (
    this=this@entry=0x7fff908e4b40 <engine>, task=...)
    at torch/csrc/autograd/engine.cpp:268
#2  0x00007fff8f61804e in torch::autograd::Engine::thread_main (
    this=0x7fff908e4b40 <engine>, graph_task=0x0) at torch/csrc/autograd/engine.cpp:144
#3  0x00007fff8f614e82 in torch::autograd::Engine::thread_init (
    this=this@entry=0x7fff908e4b40 <engine>, device=device@entry=2)
    at torch/csrc/autograd/engine.cpp:121
#4  0x00007fff8f63793a in torch::autograd::python::PythonEngine::thread_init (
    this=0x7fff908e4b40 <engine>, device=2) at torch/csrc/autograd/python_engine.cpp:28
#5  0x00007fff736f2c80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff7bc16ba in start_thread (arg=0x7fffdfe4b700) at pthread_create.c:333
#7  0x00007ffff78f73dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

The program can run correctly on another computer, so it seems that the breakdown is related to pytorch intallation or environment. I do not know how to fix it. Anyone can give some help?

Thanks~

colesbury · November 3, 2017, 5:15am

What version of PyTorch are you using?

import torch
print(torch.__version__)

EthanZhangYi · November 3, 2017, 5:33am

Hi, @colesbury I build and install Pytorch from source (master branch). The version is 0.2.0+d8ad5de
However on another server with pytorch 0.2.0_3, my program run correctly. Is it possible that the pytorch version is the reason for breakdown?

colesbury · November 3, 2017, 5:38am

We fixed a few bugs in the last few days that could be related. (I think that version is from 2 weeks ago). Can you try building from the latest master and see if you still get the segmentation fault?

If it’s still a problem, a script that reproduces the crash would be really helpful.

EthanZhangYi · November 3, 2017, 5:40am

Got it. thanks @colesbury, I will reinstall the latest pytorch and have a try.

EthanZhangYi · November 3, 2017, 9:31am

HI, @colesbury, with latest pytorch master branch version, the training still breaks down after hundreds of iterations. However if I roll back to 0.2.0 release version, the breakdown never occurs (with commit id 0b92e5c). It seems that this bug is related to pytorch version.

Since my project contains many codes, I will rewrite a simple example to reproduce this breakdown bug, and show it to you.

rotabulo · November 6, 2017, 5:52pm

Hi,
I have encountered the same issue weeks ago but could not solve it yet.
Here is the link to a pytorch discussion I opened: https://discuss.pytorch.org/t/segmentation-fault-in-input-buffer-cpp

EthanZhangYi · November 7, 2017, 2:33am

Thx~ @rotabulo Till now I find that if I use torch.rand() to generate input data and label for CNN, this segment fault won’t occur (even with multiGPU). However if I load input data and label with DataLoader, this bug occurs. It seems that this seg fault is related to Dataloader. (Images are loaded with opencv).

This seg fault does not occur with 0.2.0 release version, and only occurs with the latest master version. I also find that if I use python3 not python2, this seg fault disappear, I guess it is related to the multi-thread dataloader.

Hoping these message is helpful to you.

Yazhi_Gao · November 8, 2017, 9:47pm

this is a serious problem which i currently locate to multi-gpu DataParallel.
also dataParallel brings decrease in model accuracy. it is a huge performance gap.
hope @smth @adam to get a fix soon.

SimonW · November 8, 2017, 10:42pm

opencv is known to not work nicely with general multiprocessing in python. Could you try to reduce num_workers to 0?

SimonW · November 8, 2017, 10:48pm

Could you provide more details on the performance degrading you see please? Thanks.

Yazhi_Gao · November 9, 2017, 7:51am

There is even 30% in terms of accuracy for training performance for visual
question answering. I got the core dump before I hit the validation
epoch…

SimonW · November 9, 2017, 8:00am

Do you mean a 30% decrease in accuracy? That is weird. Do you have a script to reproduce the issue?

Yazhi_Gao · November 9, 2017, 8:07am

30%. Decrease on training set.

TStepi · November 10, 2017, 10:57am

Hello!

I think I am experiencing this issue as well. I only train on my CPU and use the DataLoader class. I experience core dumps if I set the num_workers parameter to at least 1, without I found no problems yet.

My PyTorch version is 0.2.0_4.

SimonW · November 10, 2017, 7:30pm

Hi @TStepi, do you mind sharing a snippet that can reproduce the issue? I can’t repro it.

SimonW · November 10, 2017, 7:32pm

Is the issue you are seeing a DataParallel issue or a DataLoader issue? Do you also see segfaults?

SimonW · November 10, 2017, 7:34pm

If you have a script, I can try to debug. From what I’ve been seeing it’s more likely caused by opencv. But it’s weird that it only happens on master.

Yazhi_Gao · November 11, 2017, 12:00am

i think it is a DataParallel issue.
whatever number of multi-processing worker i set to data-loader, it still dumps.
After I turn off the Multi-GPU, it is okay.

SimonW · November 11, 2017, 12:29am

So it’s a completely different issue than this post then. If you can provide a script, we can look into it.