I know that the default initialization of layers are
where a is the gain value of nonlinearity.
In my VGG, all my nonlinearity is
ReLU. So according to the paper of
kaiming_initialization, i should set a=0. When i use that initialization, loss fly to NAN.
But when i use the default initialization, i trained my net successfully.
What’s the problem with that?Why can pytorch set default gain equal to sqrt(5)?