Initialization and batch normalization

surojit_sengupta · November 27, 2018, 12:28pm

Hello Moderators,

Do we use any of the initialization techniques for weights like Glorot, He et al.; along with batch normalization in practice? or it’s a case of choosing either one?

Regards

vabh · November 27, 2018, 3:45pm

Hi,

The reference ResNet implementation uses the initialisation strategy from He et al. and it’s a network with batch norm layers.

github.com

pytorch/vision/blob/master/torchvision/models/resnet.py#L118


    self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
    self.layer1 = self._make_layer(block, 64, layers[0])
    self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
    self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
    self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
    self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
    self.fc = nn.Linear(512 * block.expansion, num_classes)


    for m in self.modules():
        if isinstance(m, nn.Conv2d):
            nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
        elif isinstance(m, nn.BatchNorm2d):
            nn.init.constant_(m.weight, 1)
            nn.init.constant_(m.bias, 0)


def _make_layer(self, block, planes, blocks, stride=1):
    downsample = None
    if stride != 1 or self.inplanes != planes * block.expansion:
        downsample = nn.Sequential(
            conv1x1(self.inplanes, planes * block.expansion, stride),
            nn.BatchNorm2d(planes * block.expansion),

Hmrishav_Bandyopadhy · July 26, 2020, 2:32am

Hi,
Sorry for opening up this old thread. Can you please explain why initialization is necessary when we are using batch normalization? Batch normalization seems to normalize along the batches and reduces the problem of the “Mean length scale in final layer” as described in How to Start Training:The Effect of Initialization and Architecture by Hanin and Rolnick. Also, by intuition, batch norm should actually reduce the mean to zero and standard deviation to one before the layer output is fed to later layers. As such, this should prevent any kind of slowing down due to zigzagging as described in Efficient Backprop by LeCun

blade · September 30, 2020, 5:12pm

There are still a shift and scale parameter applied to the normalized mini-batch input. These parameters are learned in the Batch normalization layer.

Hmrishav_Bandyopadhy · September 30, 2020, 6:23pm

Ok but batch norm actually has a flag that can be set to false in order to avoid getting the shift and scale values. If we use that, then?