BatchNorm Initialization

Recently I rebuild my caffe code with pytorch and got a much worse performance than original ones. Also I find the converge speed is slightly slower than before. When I check the initialization of model, I notice that in caffe’s BN(actually scale layer) layer parameter gamma is initialized with 1.0 while the default initialization in pytorch seems like random float numbers. In addition, when I initialize BN’s weight with 1.0 the training loss can drop faster. So I want to know what the default initialization is in pytorch’s BN layer. Anyone could please answer my question? Thanks a lot!

Regards
MeowLady

2 Likes

It’s this https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/batchnorm.py#L33-L39

w ~ U[0, 1]
b = 0

1 Like

Wow, got it! Thanks a lot for your quick and accurate reply! :slight_smile:

I might be understanding something wrong but I also want to see how the weights and bias of Batchnorm are initialised. When looking at the source code I see the following in the __init__ for the default parameters:

            self.weight = Parameter(torch.Tensor(num_features))
            self.bias = Parameter(torch.Tensor(num_features))

However when I run torch.Tensor(num_features) on my computer I get some numbers <0 and >1. Even some nan for num_features. Am I doing something wrong or missing something?

Thanks!

You are right and torch.Tensor creates an uninitialized tensor, so you would have to make sure to properly initialize it afterwards.
For batchnorm layers, it is done in the reset_parameters() method.

2 Likes

Thank you very much @ptrblck, that makes a lot of sense!

But then it does not match with w ~ U[0, 1], does it? Looking into that function and init.ones_ the method fills weights with ones, not with weights following iid U[0, 1]. Therefore it is initialized with ones and not with iid U[0, 1].

Yeah, you are right and I posted the implementation of the current master branch.
This PR was only merged recently, so that this new behavior is documented in the master docs.
Before that, you’ll see that the weights were initialized using a uniform distribution (line of code for 1.1.0).

Sorry for the confusion.

1 Like

Yay! Thank you very much for the complete explanation and pointing to the doc, code and reference. It’s been extremely helpful.