Initialization and batch normalization

Hello Moderators,

Do we use any of the initialization techniques for weights like Glorot, He et al.; along with batch normalization in practice? or it’s a case of choosing either one?



The reference ResNet implementation uses the initialisation strategy from He et al. and it’s a network with batch norm layers.


Sorry for opening up this old thread. Can you please explain why initialization is necessary when we are using batch normalization? Batch normalization seems to normalize along the batches and reduces the problem of the “Mean length scale in final layer” as described in How to Start Training:The Effect of Initialization and Architecture by Hanin and Rolnick. Also, by intuition, batch norm should actually reduce the mean to zero and standard deviation to one before the layer output is fed to later layers. As such, this should prevent any kind of slowing down due to zigzagging as described in Efficient Backprop by LeCun

There are still a shift and scale parameter applied to the normalized mini-batch input. These parameters are learned in the Batch normalization layer.

Ok but batch norm actually has a flag that can be set to false in order to avoid getting the shift and scale values. If we use that, then?