Is there a reason why Kaiming normal initialization substantially increases the training loss?
No, there shouldn’t be a specific reason for Kaiming/He increasing the training loss – personally, I only notices minor differences. Actually, the default Kaiming He (normal) initialization scheme and PyTorch’s default initialization schemes look relatively similar.
Kaiming (normal) uses:
with a=0 by default.
The PyTorch default uses:
when i see that correctly from
stdv = 1. / math.sqrt(self.weight.size(1))
if self.bias is not None:
But since the sqrt is in the denominator, it can be much larger. So you probably want to lower your learning rate when using
kaiming_normal_. Would be curious to hear what happens if you do that. Maybe choose the learning rate as follows:
learning_rate_before * default_std(fan_in) = new_learning_rate * kaiming(fan_in)
=> earning_rate_before * default_std(fan_in) / kaiming(fan_in) = new_learning_rate
Would be curious to hear what you find…