I know that the default initialization of layers are torch.nn.init.kaiming_uniform(tensor,a=sqrt(5))
where a is the gain value of nonlinearity.
In my VGG, all my nonlinearity is ReLU
. So according to the paper of kaiming_initialization
, i should set a=0. When i use that initialization, loss fly to NAN.
But when i use the default initialization, i trained my net successfully.
What’s the problem with that?Why can pytorch set default gain equal to sqrt(5)?