How to debug nan in loss issue?

lev · June 19, 2020, 3:06pm

Hi all,
I am not posting the actual code since it is a lot and cannot be reduced that easy to the problem. But essentially I am having

output = net(input) which is a batchsize x 1 tensor.
I calculate the mean of output over the batchsize and my loss function is

loss = (mean - 10).pow(2).

So I am trying to have a mean of my network output = 10. First 2 iterations the loss goes down, in the third iteration it suddenly goes up to a million and then nan.

How can I debug such an issue (the definition of the loss was a simplification)?

Atul_Kumar · June 19, 2020, 7:33pm

I think problem is with your loss function. You should formulate it differently and add more constraint otherwise gradient gonna explode. or try abs(mean-10)

lev · June 22, 2020, 6:21am

But why? What is the problem with it? (Btw. I did the same in TensorFlow, at least I hope it is the same, and received no error)

lfolle · June 22, 2020, 7:58am

The problem might be, that your loss has no upper bound. So in case your mean has high value, the resulting loss will be extremely large and possibly resulting in an Nan.

You might be able to alleviate this by limiting the loss.
Either by clipping the loss to a maximum value or by limiting the range of the mean values by using a sigmoid prior to entering the loss calculation.

lev · June 22, 2020, 8:48am

You are right, the loss is suddenly growing. But it starts properly:
The sequence is: 1600, 1550, 1520, 600000, 5e17, nan
I am wondering why it suddenly goes up like that.
If I clip the loss, wont my gradient be 0 and training collapse?
Sigmoid seems an interesting idea, I will try that, thanks.

lfolle · June 22, 2020, 9:17am

Yes you are right, I meant clipping the gradient. This should avoid extreme steps during the optimization.
You might also try to lower your learning rate to see if the current learining rate is to high.

lev · June 22, 2020, 9:49am

With much reduced learning rate it works out but too slow and convergence is not as good as it should be (comparing to TensorFlow version).
Ok I can try clipping the gradient, thanks.
I now tried Sigmoid and the result was that loss went down for many iterations (hundreds) and then again suddendly went up in shortest time (10 iterations) and gave nan again. I am more and more confused.

lev · June 22, 2020, 10:38am

Thank you once more for you advice. The clipping did the trick in so far that it does converge. Value is not very good though.
Still I am puzzled why this gradient explodes so suddenly. But it’s good to control it