I read the paper about batch normalization, but I do not find how does it initialize the weight. So I find the code in PyTorch as below:
nn/modules/batchnorm.py line31,32:
self.weight.data.uniform_()
self.bias.data.zero_()
So why do the wights of batch normalization initialize like this? Is there any theory that this inialization is optimal?