Encountering segmentation fault after several iterations


(Linghai Liu) #1

I implemented a gaussian noise layer as follows

class GaussianNoise(nn.Module):
    def __init__(self, stddev):
        super(GaussianNoise, self).__init__()
        self.stddev = stddev

    def forward(self, x):
        if self.training:
            noise=torch.Tensor(x.size()).normal_(0, self.stddev)
            if x.is_cuda:
                noise=noise.cuda()
            noise_var=Variable(noise)

            return x+noise_var

        return x 

A neural network including this layer starts training normally. However after about 20 iterations, the training scripts quited with segmentation fault. I tried the same neural network without this layer and there was no such issue. I got the trace from gdb:

#0  0x00007fffe4067ac7 in THRandom_random () from /home/liulhai/miniconda2/lib/python2.7/site-packages/torch/lib/libTH.so.1
#1  0x00007fffe4067c36 in THRandom_normal () from /home/liulhai/miniconda2/lib/python2.7/site-packages/torch/lib/libTH.so.1
#2  0x00007fffe3d84330 in THFloatTensor_normal () from /home/liulhai/miniconda2/lib/python2.7/site-packages/torch/lib/libTH.so.1
#3  0x00007ffff03d3c83 in THPFloatTensor_normal_ (self=0x7fffa1c95758, args=<optimized out>, kwargs=<optimized out>) at /home/liulhai/pytorch/torch/csrc/generic/TensorMethods.cpp:57210
#4  0x00007ffff7adf615 in PyEval_EvalFrameEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#5  0x00007ffff7ae14e9 in PyEval_EvalCodeEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#6  0x00007ffff7a6a0c7 in function_call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#7  0x00007ffff7a45773 in PyObject_Call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#8  0x00007ffff7ada4d0 in PyEval_EvalFrameEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#9  0x00007ffff7ae14e9 in PyEval_EvalCodeEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#10 0x00007ffff7a69fda in function_call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#11 0x00007ffff7a45773 in PyObject_Call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#12 0x00007ffff7a5450d in instancemethod_call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#13 0x00007ffff7a45773 in PyObject_Call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#14 0x00007ffff7a9e574 in slot_tp_call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#15 0x00007ffff7a45773 in PyObject_Call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#16 0x00007ffff7ad953b in PyEval_EvalFrameEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#17 0x00007ffff7ae14e9 in PyEval_EvalCodeEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#18 0x00007ffff7a6a0c7 in function_call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#19 0x00007ffff7a45773 in PyObject_Call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#20 0x00007ffff7ada4d0 in PyEval_EvalFrameEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#21 0x00007ffff7ae14e9 in PyEval_EvalCodeEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#22 0x00007ffff7a6a0c7 in function_call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#23 0x00007ffff7a45773 in PyObject_Call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#24 0x00007ffff7a5450d in instancemethod_call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#25 0x00007ffff7a45773 in PyObject_Call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#26 0x00007ffff7a9e574 in slot_tp_call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#27 0x00007ffff7a45773 in PyObject_Call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#28 0x00007ffff7ada4d0 in PyEval_EvalFrameEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#29 0x00007ffff7ae14e9 in PyEval_EvalCodeEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#30 0x00007ffff7a6a0c7 in function_call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#31 0x00007ffff7a45773 in PyObject_Call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#32 0x00007ffff7ada4d0 in PyEval_EvalFrameEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#33 0x00007ffff7adfdac in PyEval_EvalFrameEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#34 0x00007ffff7adfdac in PyEval_EvalFrameEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#35 0x00007ffff7ae14e9 in PyEval_EvalCodeEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#36 0x00007ffff7a69fda in function_call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#37 0x00007ffff7a45773 in PyObject_Call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#38 0x00007ffff7a5450d in instancemethod_call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#39 0x00007ffff7a45773 in PyObject_Call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#40 0x00007ffff7ad76d8 in PyEval_CallObjectWithKeywords () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#41 0x00007ffff7b10d46 in t_bootstrap () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#42 0x00007ffff77e7184 in start_thread (arg=0x7fff92fbd700) at pthread_create.c:312
#43 0x00007ffff6e0737d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

But i could not decipher those lines above. HELP PLEASE!


(Alban D) #2

Hi,

Could you give a small code snippet or the size of x and the value of stddev such that I can try to reproduce this locally please?


(Linghai Liu) #3

Hi albanD,

Sorry for my late response. I might find out the answer myself. I used multi GPUs for training. The Gaussian Noise layer defined above creates tensors and moves them to GPUs at run-time. The created tensors may not be on the same GPUs with those whom are to be added noise to. When I used a single GPU for training, the segmentation fault problem did not occur.

So the problem reduced to how to make sure the created tensors moved to the proper GPU?


(Alban D) #4

Even in that case, it should not segfault but return a proper error message.
Also the error occurs in cpu code, not on cuda code.