Implementation of Batch Renormalization fails unexpectedly (Segementation fault core dumped)

Zafarullah_Mahmood · March 23, 2017, 7:45am

This is my implementation of Batch Renormalization.

class BatchReNorm1d(Module):

    def __init__(self, num_features, eps=1e-5, momentum=0.1, rmax=3.0, dmax=5.0, affine=True):
        super(BatchReNorm1d, self).__init__()
        self.num_features = num_features
        self.affine = affine
        self.eps = eps
        self.momentum = momentum
        self.rmax = rmax
        self.dmax = dmax
        if self.affine:
            self.weight = Parameter(torch.Tensor(num_features))
            self.bias = Parameter(torch.Tensor(num_features))
        else:
            self.register_parameter('weight', None)
            self.register_parameter('bias', None)
        self.register_buffer('running_mean', torch.zeros(num_features))
        self.register_buffer('running_var', torch.ones(num_features))
        self.register_buffer('r', torch.ones(1))
        self.register_buffer('d', torch.zeros(1))
        self.reset_parameters()

    def reset_parameters(self):
        self.running_mean.zero_()
        self.running_var.fill_(1)
        self.r.fill_(1)
        self.d.zero_()
        if self.affine:
            self.weight.data.uniform_()
            self.bias.data.zero_()

    def _check_input_dim(self, input):
        if input.size(1) != self.running_mean.nelement():
            raise ValueError('got {}-feature tensor, expected {}'
                             .format(input.size(1), self.num_features))

    def forward(self, input):  
        self._check_input_dim(input)
        if self.training:
            sample_mean = torch.mean(input, dim=0)
            sample_var =  torch.var(input, dim=0)
            
            self.r = torch.clamp(sample_var.data / self.running_var, 
                            1./self.rmax, self.rmax)
            self.d = torch.clamp((sample_mean.data - self.running_mean)/ self.running_var,
                            -self.dmax, self.dmax)
            
            input_normalized = (input - sample_mean.expand_as(input))/sample_var.expand_as(input)
            input_normalized = input_normalized*Variable(self.r).expand_as(input)
            input_normalized += Variable(self.d).expand_as(input)
            
            self.running_mean += self.momentum * (sample_mean.data - self.running_mean)
            self.running_var  += self.momentum * (sample_var.data - self.running_var)
            
            if self.affine:
                input_normalized = input_normalized * self.weight.expand_as(input)
                input_normalized += self.bias.unsqueeze(0).expand_as(input)
                return input_normalized
            
            else:
                return input_normalized
#         else:
#             input_normalized = (input - self.running_mean.expand_as(input))/self.running_var.expand_as(input)
#             if self.affine:
#                 return input_normalized * self.weight.expand_as(input) + self.bias.expand_as(inputs)
#             else:
#                 return input_normalized
                  
    def __repr__(self):
        return ('{name}({num_features}, eps={eps}, momentum={momentum},'
                ' affine={affine})'
                .format(name=self.__class__.__name__, **self.__dict__))

When I forward through it, using a toy model, my IPython kernel dies without any error? What could be the possible explanation for this behaviour? The strange thing is: the kernel dies at different points for different iterations

When I run in as a python file, the output is Segmentation fault (core dumped)

jekbradbury · March 23, 2017, 8:12am

Can you try running it in gdb (gdb --args python your_script.py, then type r to run and bt to show the backtrace after the segfault), then paste the result here?

Zafarullah_Mahmood · March 23, 2017, 8:26am

Below is the output from my terminal

(pytorch) zafar@inspiron:~/Desktop$ gdb --args python batchrenorm.py
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...(no debugging symbols found)...done.
(gdb) r
Starting program: /home/zafar/pytorch/bin/python batchrenorm.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff3c06700 (LWP 15536)]
[New Thread 0x7ffff3405700 (LWP 15537)]
[New Thread 0x7fffeec04700 (LWP 15538)]
[New Thread 0x7fffec403700 (LWP 15539)]
[New Thread 0x7fffe9c02700 (LWP 15540)]
[New Thread 0x7fffe9401700 (LWP 15541)]
[New Thread 0x7fffe4c00700 (LWP 15542)]
Files already downloaded and verified
Files already downloaded and verified
[Thread 0x7fffe9401700 (LWP 15541) exited]
[Thread 0x7fffe9c02700 (LWP 15540) exited]
[Thread 0x7fffec403700 (LWP 15539) exited]
[Thread 0x7fffeec04700 (LWP 15538) exited]
[Thread 0x7ffff3405700 (LWP 15537) exited]
[Thread 0x7ffff3c06700 (LWP 15536) exited]
[Thread 0x7fffe4c00700 (LWP 15542) exited]
[New Thread 0x7fffe4c00700 (LWP 15548)]
[New Thread 0x7fffe9401700 (LWP 15549)]
[New Thread 0x7fffe9c02700 (LWP 15550)]
[New Thread 0x7fffec403700 (LWP 15551)]
[New Thread 0x7ffff3961980 (LWP 15552)]
[New Thread 0x7ffff3560a00 (LWP 15553)]
[New Thread 0x7ffff315fa80 (LWP 15554)]

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
malloc_consolidate (av=av@entry=0x7ffff7bb4b20 <main_arena>) at malloc.c:4179
4179	malloc.c: No such file or directory.
(gdb) bt
#0  malloc_consolidate (av=av@entry=0x7ffff7bb4b20 <main_arena>)
    at malloc.c:4179
#1  0x00007ffff78710a8 in _int_free (av=0x7ffff7bb4b20 <main_arena>, 
    p=<optimized out>, have_lock=0) at malloc.c:4073
#2  0x00007ffff787498c in __GI___libc_free (mem=<optimized out>)
    at malloc.c:2966
#3  0x00007fffd8ff6e2e in THFloatStorage_free ()
   from /home/zafar/pytorch/lib/python3.5/site-packages/torch/lib/libTH.so.1
#4  0x00007fffd900f564 in THFloatTensor_free ()
   from /home/zafar/pytorch/lib/python3.5/site-packages/torch/lib/libTH.so.1
#5  0x00007fffdf58261d in THPFloatTensor_dealloc (self=0x7fffaf076a88)
    at /data/users/soumith/builder/wheel/pytorch-src/torch/csrc/generic/Tensor.cpp:70
#6  0x000000000055d9ba in ?? ()
#7  0x00007fffdf8d9309 in THPPointer<_object>::~THPPointer (this=0x3d06220, 
    __in_chrg=<optimized out>)
    at /data/users/soumith/builder/wheel/pytorch-src/torch/csrc/utils/object_ptr.h:12
#8  std::_Head_base<0ul, THPPointer<_object>, false>::~_Head_base (
    this=0x3d06220, __in_chrg=<optimized out>)
    at /data/users/soumith/miniconda2/envs/py35k/gcc/include/c++/tuple:129
#9  std::_Tuple_impl<0ul, THPPointer<_object>, int, std::unique_ptr<torch::autograd::VariableVersion, std::default_delete<torch::autograd::VariableVersion> > >:---Type <return> to continue, or q <return> to quit---q
Quit

Zafarullah_Mahmood · March 23, 2017, 8:49am

I think I have found the culprit. It is the line
input_normalized = (input - sample_mean.expand_as(input))/sample_var.expand_as(input)
in the forward call of BatchReNorm1d. But I don’t know why is this a problem? @jekbradbury Please tell me what to do?

When I break this line into two lines:

input_normalized = (input - sample_mean.expand_as(input))
input_normalized = input_normalized/sample_var.expand_as(input)

things are working fine (at least I hope so). And this behaviour is intriguing.

jekbradbury · March 23, 2017, 4:40pm

Whoa, that doesn’t make very much sense. I’ll see if I can run your code today.

Zafarullah_Mahmood · March 23, 2017, 5:41pm

@jekbradbury I probably figured out what was going wrong. The divisor sample_var.expand_as(input) became zero sometimes, which led to unexpected failure. I forgot to give a lower bound of eps to sample_var. Now the code is running fine. Thanks a lot !! Please have a look at my implementation and tell me how can I improve it.

Hengck · May 2, 2017, 3:51am

Is there a version for batch renormalisation 2d?