Weights become NaN values after first batch step

ptrblck · July 9, 2020, 10:54pm

It seems that the gradient in bn17 gets the first NaN values, while the last layer (fc1) seems to have valid gradients.
You could check, how large the gradients in fc1 are and check, why they might be overflowing in the bn17 layer.
It also seems that a lower learning rate delays the first NaN result, which might also point to a high magnitude in some values.

Could you also check the stdv in your normalize method?
Since you are not dividing with an eps, the output might get huge numbers, if stdv is small or close to zero.