BatchNorm Initialization

MeowLady · April 10, 2018, 3:03am

Recently I rebuild my caffe code with pytorch and got a much worse performance than original ones. Also I find the converge speed is slightly slower than before. When I check the initialization of model, I notice that in caffe’s BN(actually scale layer) layer parameter gamma is initialized with 1.0 while the default initialization in pytorch seems like random float numbers. In addition, when I initialize BN’s weight with 1.0 the training loss can drop faster. So I want to know what the default initialization is in pytorch’s BN layer. Anyone could please answer my question? Thanks a lot!

Regards
MeowLady

SimonW · April 10, 2018, 3:08am

It’s this https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/batchnorm.py#L33-L39

w ~ U[0, 1]
b = 0

MeowLady · April 10, 2018, 3:16am

Wow, got it! Thanks a lot for your quick and accurate reply!

r0mer0m · June 19, 2019, 9:44pm

I might be understanding something wrong but I also want to see how the weights and bias of Batchnorm are initialised. When looking at the source code I see the following in the __init__ for the default parameters:

            self.weight = Parameter(torch.Tensor(num_features))
            self.bias = Parameter(torch.Tensor(num_features))

However when I run torch.Tensor(num_features) on my computer I get some numbers <0 and >1. Even some nan for num_features. Am I doing something wrong or missing something?

Thanks!

ptrblck · June 20, 2019, 12:24am

You are right and torch.Tensor creates an uninitialized tensor, so you would have to make sure to properly initialize it afterwards.
For batchnorm layers, it is done in the reset_parameters() method.

r0mer0m · June 20, 2019, 6:25pm

Thank you very much @ptrblck, that makes a lot of sense!

But then it does not match with w ~ U[0, 1], does it? Looking into that function and init.ones_ the method fills weights with ones, not with weights following iid U[0, 1]. Therefore it is initialized with ones and not with iid U[0, 1].

ptrblck · June 20, 2019, 6:45pm

Yeah, you are right and I posted the implementation of the current master branch.
This PR was only merged recently, so that this new behavior is documented in the master docs.
Before that, you’ll see that the weights were initialized using a uniform distribution (line of code for 1.1.0).

Sorry for the confusion.

r0mer0m · June 20, 2019, 7:05pm

Yay! Thank you very much for the complete explanation and pointing to the doc, code and reference. It’s been extremely helpful.