Loss is not Nan, but the gradients are

Hi all,
I’ve found that in neural network, I’m coming across non-Nan losses with NaN grads. I’m using Adam with default parameters. Nothing fancy in my network. Has anyone come across such an issue? I’ve always known that NaN losses cause NaN gradients. But this is a bit odd.



Many operations could give you NaN in the backward even with non-NaN values in the forward. For example, sqrt at 0.

You will need to find where NaN appear in the backward to be sure.
If you’re using master, you can use anomaly detection to get that information.


I don’t seem to have it. What is master, anyway?
Also, while debugging, I’ve noticed gradients of the order of 1e-19. Could that be a problem?

What I call master is the version of pytorch from the current master branch on github. Not the releases.
It depends what operations you are doing, 0/0 will give you NaN for example.

I’ve almost always had this problem as the result of takign the square root of 0.

Note that .std() implicitly takes the square root.

1 Like

whooo, anomaly detection looks interesting :slight_smile:


Is there any known issue with softmax and small values?

If my pytorch version is 0.3, is there any method to check where causes the gradient NAN?Or, could you please give some examples?

To check the forward pass, you will need to add prints.
To check the backward pass you will need to add hooks and prints.
Unfortunately there is no easier way than doing it by hand in older versions.

Hi~ Is it possible that loss is NaN while the gradients are not?

No, that should not be possible since the NaN loss value would be backpropagated and would create invalid gradients throughout the model. At least I wouldn’t know which operation can “recover” the gradient again and how it would work.

Get it. Thanks for your patience!