Loss is not Nan, but the gradients are

Sairaam_Venkatraman · July 16, 2018, 3:45pm

Hi all,
I’ve found that in neural network, I’m coming across non-Nan losses with NaN grads. I’m using Adam with default parameters. Nothing fancy in my network. Has anyone come across such an issue? I’ve always known that NaN losses cause NaN gradients. But this is a bit odd.

albanD · July 17, 2018, 9:51am

Hi,

Many operations could give you NaN in the backward even with non-NaN values in the forward. For example, sqrt at 0.

You will need to find where NaN appear in the backward to be sure.
If you’re using master, you can use anomaly detection to get that information.

Sairaam_Venkatraman · July 17, 2018, 10:07am

I don’t seem to have it. What is master, anyway?
Also, while debugging, I’ve noticed gradients of the order of 1e-19. Could that be a problem?

albanD · July 17, 2018, 10:20am

What I call master is the version of pytorch from the current master branch on github. Not the releases.
It depends what operations you are doing, 0/0 will give you NaN for example.

hughperkins · July 17, 2018, 12:20pm

I’ve almost always had this problem as the result of takign the square root of 0.

Note that .std() implicitly takes the square root.

hughperkins · July 17, 2018, 12:20pm

whooo, anomaly detection looks interesting

Sairaam_Venkatraman · July 17, 2018, 4:50pm

Is there any known issue with softmax and small values?

xuanyang · July 24, 2018, 4:51am

If my pytorch version is 0.3, is there any method to check where causes the gradient NAN?Or, could you please give some examples?

albanD · July 24, 2018, 8:57am

To check the forward pass, you will need to add prints.
To check the backward pass you will need to add hooks and prints.
Unfortunately there is no easier way than doing it by hand in older versions.

Guangyao_Li · August 30, 2022, 3:57am

Hi~ Is it possible that loss is NaN while the gradients are not?

ptrblck · August 30, 2022, 4:27am

No, that should not be possible since the NaN loss value would be backpropagated and would create invalid gradients throughout the model. At least I wouldn’t know which operation can “recover” the gradient again and how it would work.

Guangyao_Li · August 30, 2022, 6:30am

Get it. Thanks for your patience!