Encountering segmentation fault after several iterations

I implemented a gaussian noise layer as follows

class GaussianNoise(nn.Module):
    def __init__(self, stddev):
        super(GaussianNoise, self).__init__()
        self.stddev = stddev

    def forward(self, x):
        if self.training:
            noise=torch.Tensor(x.size()).normal_(0, self.stddev)
            if x.is_cuda:
                noise=noise.cuda()
            noise_var=Variable(noise)

            return x+noise_var

        return x 

A neural network including this layer starts training normally. However after about 20 iterations, the training scripts quited with segmentation fault. I tried the same neural network without this layer and there was no such issue. I got the trace from gdb:

#0  0x00007fffe4067ac7 in THRandom_random () from /home/liulhai/miniconda2/lib/python2.7/site-packages/torch/lib/libTH.so.1
#1  0x00007fffe4067c36 in THRandom_normal () from /home/liulhai/miniconda2/lib/python2.7/site-packages/torch/lib/libTH.so.1
#2  0x00007fffe3d84330 in THFloatTensor_normal () from /home/liulhai/miniconda2/lib/python2.7/site-packages/torch/lib/libTH.so.1
#3  0x00007ffff03d3c83 in THPFloatTensor_normal_ (self=0x7fffa1c95758, args=<optimized out>, kwargs=<optimized out>) at /home/liulhai/pytorch/torch/csrc/generic/TensorMethods.cpp:57210
#4  0x00007ffff7adf615 in PyEval_EvalFrameEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#5  0x00007ffff7ae14e9 in PyEval_EvalCodeEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#6  0x00007ffff7a6a0c7 in function_call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#7  0x00007ffff7a45773 in PyObject_Call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#8  0x00007ffff7ada4d0 in PyEval_EvalFrameEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#9  0x00007ffff7ae14e9 in PyEval_EvalCodeEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#10 0x00007ffff7a69fda in function_call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#11 0x00007ffff7a45773 in PyObject_Call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#12 0x00007ffff7a5450d in instancemethod_call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#13 0x00007ffff7a45773 in PyObject_Call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#14 0x00007ffff7a9e574 in slot_tp_call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#15 0x00007ffff7a45773 in PyObject_Call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#16 0x00007ffff7ad953b in PyEval_EvalFrameEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#17 0x00007ffff7ae14e9 in PyEval_EvalCodeEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#18 0x00007ffff7a6a0c7 in function_call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#19 0x00007ffff7a45773 in PyObject_Call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#20 0x00007ffff7ada4d0 in PyEval_EvalFrameEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#21 0x00007ffff7ae14e9 in PyEval_EvalCodeEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#22 0x00007ffff7a6a0c7 in function_call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#23 0x00007ffff7a45773 in PyObject_Call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#24 0x00007ffff7a5450d in instancemethod_call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#25 0x00007ffff7a45773 in PyObject_Call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#26 0x00007ffff7a9e574 in slot_tp_call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#27 0x00007ffff7a45773 in PyObject_Call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#28 0x00007ffff7ada4d0 in PyEval_EvalFrameEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#29 0x00007ffff7ae14e9 in PyEval_EvalCodeEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#30 0x00007ffff7a6a0c7 in function_call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#31 0x00007ffff7a45773 in PyObject_Call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#32 0x00007ffff7ada4d0 in PyEval_EvalFrameEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#33 0x00007ffff7adfdac in PyEval_EvalFrameEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#34 0x00007ffff7adfdac in PyEval_EvalFrameEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#35 0x00007ffff7ae14e9 in PyEval_EvalCodeEx () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#36 0x00007ffff7a69fda in function_call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#37 0x00007ffff7a45773 in PyObject_Call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#38 0x00007ffff7a5450d in instancemethod_call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#39 0x00007ffff7a45773 in PyObject_Call () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#40 0x00007ffff7ad76d8 in PyEval_CallObjectWithKeywords () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#41 0x00007ffff7b10d46 in t_bootstrap () from /home/liulhai/miniconda2/bin/../lib/libpython2.7.so.1.0
#42 0x00007ffff77e7184 in start_thread (arg=0x7fff92fbd700) at pthread_create.c:312
#43 0x00007ffff6e0737d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

But i could not decipher those lines above. HELP PLEASE!

1 Like

Hi,

Could you give a small code snippet or the size of x and the value of stddev such that I can try to reproduce this locally please?

Hi albanD,

Sorry for my late response. I might find out the answer myself. I used multi GPUs for training. The Gaussian Noise layer defined above creates tensors and moves them to GPUs at run-time. The created tensors may not be on the same GPUs with those whom are to be added noise to. When I used a single GPU for training, the segmentation fault problem did not occur.

So the problem reduced to how to make sure the created tensors moved to the proper GPU?

Even in that case, it should not segfault but return a proper error message.
Also the error occurs in cpu code, not on cuda code.