The weight of the convolution kernel become NaN after training several batches

ojipadeson · April 30, 2022, 9:46am

My network’s weights suddenly change to NaN during the training process. That is, it trains normally for a period of time and suddenly go to NaN in a random batch(not the same batch). In debug mode, I found all inputs are normal. And the item() of the last loss is normal. What may cause the network weights to be NaN?

BrianPulfer · April 30, 2022, 1:57pm

Weights going to NaN is typically due to overflow. The most common cause I know for this issue is a too high learning rate and no gradient clipping. This makes the parameters of your network to diverge towards +/- infinity.

Have you tried lowering the learning rate? You could also log the norm of the gradients (Check the norm of gradients).

ojipadeson · May 2, 2022, 7:36am

I applied nn.utils.clip_grad_norm_(max_norm=2) in my learning process, but there is still the case of NaN. My init learning rate is 0.0004 and I use an ExponentialLR(gamma=0.9) learning schedule. I’m wondering if the reduction of loss to mean or sum or the weight of every sub-loss (loss = w1*loss1 + w2*loss2 + .. + wn*lossn) would have a significant effect to the gradients. And is there a way to ensure that the gradient in the network does not appear Nan?

BrianPulfer · May 2, 2022, 8:10am

There is a similar thread at Gradient value is nan - #3 by saumya0303

As @ptrblck suggested, you could use torch.autograd.set_detect_anomaly(True) to see when the gradient go to NaN and debug from there.

Hope this helps.