It seems that the gradient in bn17
gets the first NaN values, while the last layer (fc1
) seems to have valid gradients.
You could check, how large the gradients in fc1
are and check, why they might be overflowing in the bn17
layer.
It also seems that a lower learning rate delays the first NaN result, which might also point to a high magnitude in some values.
Could you also check the stdv
in your normalize
method?
Since you are not dividing with an eps
, the output might get huge numbers, if stdv
is small or close to zero.