Hi,
When I train a model with pytorch, sometimes it breaks down after hundreds of iterations with segmentation fault (core dumped). No other error information is printed. Then I have to kill the python threads manually to release the GPU memory.
I ran the program with gdb python and got
[Thread 0x7fffd5e47700 (LWP 16952) exited]
[Thread 0x7fffd3646700 (LWP 16951) exited]
[Thread 0x7fffd8648700 (LWP 16953) exited]
[Thread 0x7fffd0e45700 (LWP 16954) exited]
Thread 98 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffdfe4b700 (LWP 15961)]
torch::autograd::InputBuffer::add (this=this@entry=0x7fffdfe4ab10, pos=pos@entry=0,
var=...) at torch/csrc/autograd/input_buffer.cpp:17
17 torch/csrc/autograd/input_buffer.cpp: No such file or directory.
(gdb) where
#0 torch::autograd::InputBuffer::add (this=this@entry=0x7fffdfe4ab10, pos=pos@entry=0,
var=...) at torch/csrc/autograd/input_buffer.cpp:17
#1 0x00007fff8f616aad in torch::autograd::Engine::evaluate_function (
this=this@entry=0x7fff908e4b40 <engine>, task=...)
at torch/csrc/autograd/engine.cpp:268
#2 0x00007fff8f61804e in torch::autograd::Engine::thread_main (
this=0x7fff908e4b40 <engine>, graph_task=0x0) at torch/csrc/autograd/engine.cpp:144
#3 0x00007fff8f614e82 in torch::autograd::Engine::thread_init (
this=this@entry=0x7fff908e4b40 <engine>, device=device@entry=2)
at torch/csrc/autograd/engine.cpp:121
#4 0x00007fff8f63793a in torch::autograd::python::PythonEngine::thread_init (
this=0x7fff908e4b40 <engine>, device=2) at torch/csrc/autograd/python_engine.cpp:28
#5 0x00007fff736f2c80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007ffff7bc16ba in start_thread (arg=0x7fffdfe4b700) at pthread_create.c:333
#7 0x00007ffff78f73dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
The program can run correctly on another computer, so it seems that the breakdown is related to pytorch intallation or environment. I do not know how to fix it. Anyone can give some help?
Hi, @colesbury I build and install Pytorch from source (master branch). The version is 0.2.0+d8ad5de
However on another server with pytorch 0.2.0_3, my program run correctly. Is it possible that the pytorch version is the reason for breakdown?
We fixed a few bugs in the last few days that could be related. (I think that version is from 2 weeks ago). Can you try building from the latest master and see if you still get the segmentation fault?
If it’s still a problem, a script that reproduces the crash would be really helpful.
HI, @colesbury, with latest pytorch master branch version, the training still breaks down after hundreds of iterations. However if I roll back to 0.2.0 release version, the breakdown never occurs (with commit id 0b92e5c). It seems that this bug is related to pytorch version.
Since my project contains many codes, I will rewrite a simple example to reproduce this breakdown bug, and show it to you.
Thx~ @rotabulo Till now I find that if I use torch.rand() to generate input data and label for CNN, this segment fault won’t occur (even with multiGPU). However if I load input data and label with DataLoader, this bug occurs. It seems that this seg fault is related to Dataloader. (Images are loaded with opencv).
This seg fault does not occur with 0.2.0 release version, and only occurs with the latest master version. I also find that if I use python3 not python2, this seg fault disappear, I guess it is related to the multi-thread dataloader.
this is a serious problem which i currently locate to multi-gpu DataParallel.
also dataParallel brings decrease in model accuracy. it is a huge performance gap.
hope @smth@adam to get a fix soon.
I think I am experiencing this issue as well. I only train on my CPU and use the DataLoader class. I experience core dumps if I set the num_workers parameter to at least 1, without I found no problems yet.
i think it is a DataParallel issue.
whatever number of multi-processing worker i set to data-loader, it still dumps.
After I turn off the Multi-GPU, it is okay.