I am trying to initialize the layers of my network using
kaiming_normal_ initializer. This method has an argument called
I read the docs and found out it depends on the number of filters or channels. For example, I have a
Conv2d layer with size of
[64, 3, 4, 4]. When I calculate the values, I get
fan_out=1024 which is a huge difference and result in
std = gain/math.sqrt(fan) is like this:
std is passed to
The question is I did not find any specific explanation about which
mode should be chose in the paper I am trying to implement or other resources.
As far as I know, kaiming_normal_ or he_normal is generally initialized using fan_in.
There are two parts:
- As Avinash points out, the default mode
'fan_in' is probably a good choice.
- For some intuition of why this is: Each output is a weighted sum of
fan_in inputs. For linear, this is one row of the matrix multiplication, for convolutions it is number of in-channels * kernel size.
When @ptrblck and I implemented StyleGAN for PyTorch (but I don’t think the StyleGAN authors necessarily invented it), I’ve come across the idea of not having the multiplier used during init, but applying them to the weight before using it. This has the effect of using the scaling for both initial values and gradient updates. (Even if it is not that efficient without a hand-made convolution kernel.)
Why dose resnet in torchvision apply kaiming_normal_ with mode ‘fan_out’? Is there a specific reason to do so?