The weight of the convolution kernel become NaN after training several batches

My network’s weights suddenly change to NaN during the training process. That is, it trains normally for a period of time and suddenly go to NaN in a random batch(not the same batch). In debug mode, I found all inputs are normal. And the item() of the last loss is normal. What may cause the network weights to be NaN?

Weights going to NaN is typically due to overflow. The most common cause I know for this issue is a too high learning rate and no gradient clipping. This makes the parameters of your network to diverge towards +/- infinity.

Have you tried lowering the learning rate? You could also log the norm of the gradients (Check the norm of gradients).

I applied nn.utils.clip_grad_norm_(max_norm=2) in my learning process, but there is still the case of NaN. My init learning rate is 0.0004 and I use an ExponentialLR(gamma=0.9) learning schedule. I’m wondering if the reduction of loss to mean or sum or the weight of every sub-loss (loss = w1*loss1 + w2*loss2 + .. + wn*lossn) would have a significant effect to the gradients. And is there a way to ensure that the gradient in the network does not appear Nan?

There is a similar thread at Gradient value is nan - #3 by saumya0303

As @ptrblck suggested, you could use torch.autograd.set_detect_anomaly(True) to see when the gradient go to NaN and debug from there.

Hope this helps.