Hi all,
I am not posting the actual code since it is a lot and cannot be reduced that easy to the problem. But essentially I am having

output = net(input) which is a batchsize x 1 tensor.
I calculate the mean of output over the batchsize and my loss function is

loss = (mean - 10).pow(2).

So I am trying to have a mean of my network output = 10. First 2 iterations the loss goes down, in the third iteration it suddenly goes up to a million and then nan.

How can I debug such an issue (the definition of the loss was a simplification)?

I think problem is with your loss function. You should formulate it differently and add more constraint otherwise gradient gonna explode. or try abs(mean-10)

The problem might be, that your loss has no upper bound. So in case your mean has high value, the resulting loss will be extremely large and possibly resulting in an Nan.

You might be able to alleviate this by limiting the loss.
Either by clipping the loss to a maximum value or by limiting the range of the mean values by using a sigmoid prior to entering the loss calculation.

You are right, the loss is suddenly growing. But it starts properly:
The sequence is: 1600, 1550, 1520, 600000, 5e17, nan
I am wondering why it suddenly goes up like that.
If I clip the loss, wont my gradient be 0 and training collapse?
Sigmoid seems an interesting idea, I will try that, thanks.

Yes you are right, I meant clipping the gradient. This should avoid extreme steps during the optimization.
You might also try to lower your learning rate to see if the current learining rate is to high.

With much reduced learning rate it works out but too slow and convergence is not as good as it should be (comparing to TensorFlow version).
Ok I can try clipping the gradient, thanks.
I now tried Sigmoid and the result was that loss went down for many iterations (hundreds) and then again suddendly went up in shortest time (10 iterations) and gave nan again. I am more and more confused.

Thank you once more for you advice. The clipping did the trick in so far that it does converge. Value is not very good though.
Still I am puzzled why this gradient explodes so suddenly. But it’s good to control it