Nan Loss coming after some time

Vijay_Dubey · December 26, 2017, 5:23pm

The loss function is a combination of Mean Sqaured error loss and cross-entropy loss.
When i am training my model, there is a finite loss but after some time, the loss is NaN and continues to be so.
When I am training my model just on a single batch of 10 images, the loss is finite most of the times, but sometimes that is also NaN.
Please suggest a possible solution.

Thanks in advance

tom · December 26, 2017, 8:34pm

Usually, the gradients become NaN first. The first two things to look at are a reduced learning rate and possibly gradient clipping.

Best regards

Thomas

richard · December 27, 2017, 12:41am

@tom’s suggestions are great. Normalizing the data may help as well.

Vijay_Dubey · December 27, 2017, 9:58am

Hi @tom
Thanks for your reply.
Can you also suggest how can I check if the gradients are becoming NaN first and also how can I ensure gradient clipping with a SGD or AdaGrad Optimizer.

Thanks

Vijay_Dubey · December 27, 2017, 8:59pm

I tried doing it. But how should I decide the mean and standard deviation for the operation?

Thanks

richard · December 28, 2017, 1:38am

You could use a normalization layer. Alternatively, you can try dividing by some constant first (perhaps equal to the max value of your data?) The idea is to get the values low enough that they don’t cause really large gradients.

AliSh · September 15, 2019, 9:44am

Hi, I am facing the same issue, I am wondering if you have solved it please?

ptrblck · September 15, 2019, 11:00pm

Have you tried the suggestions from this thread and is nothing working?
If so, when do you see the first NaN value?
Could you additionally check out input for NaN values?

AliSh · September 17, 2019, 2:40pm

Thanks for reply, none of the suggestion in this thread worked for me.
Finally, I have solved my issue by your suggestion in another thread Getting Nan after first iteration with custom loss .

t.ouyang · December 17, 2019, 1:49pm

Here is a way of debuging the nan problem.
First, print your model gradients because there are likely to be nan in the first place.
And then check the loss, and then check the input of your loss…Just follow the clue and you will find the bug resulting in nan problem.

There are some useful infomation about why nan problem could happen:
1.the learning rate
2.sqrt(0)
3.ReLU->LeakyReLU

BikashgG · December 27, 2019, 8:16pm

Why is sqrt(0) a Nan? Should it not be equal to 0?

monster · June 25, 2020, 11:05am

I have 100 folders of different class images and I am getting nan loss value in some folders, I already checked gray scale,truncated,missing labels etc everything is fine but still getting ‘nan’ loss. What could be the possible reason ?

ptrblck · June 26, 2020, 3:17am

Do you get a NaN output from your model if you are using samples from certain folders or what do you mean by:

If your model is returning NaNs, you could set torch.autograd.detect_anomaly(True) at the beginning of your script to get a stack trace, which would hopefully point to the operation, which is creating the NaNs.

monster · June 27, 2020, 4:32am

I am getting ‘nan’ loss after 1st epoch on a large dataset, please tell me all possible reasons for ‘nan’ loss value.
check this :

dict_values([tensor(5.5172, device='cuda:0', grad_fn=<NllLossBackward>), tensor(nan, device='cuda:0', grad_fn=<DivBackward0>), tensor(3.7665, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), tensor(inf, device='cuda:0', grad_fn=<DivBackward0>)])

ptrblck · June 29, 2020, 7:30am

NaN values can be created by invalid operations, such as torch.log(torch.tensor(-1.)), by operations executed on Infs (created through over-/underflows) etc.
To isolate it, use torch.autograd.set_detect_anomaly(True).

monster · June 30, 2020, 4:44pm

Thanks for reply.
Can I just use gradient clipping? If yes,how can I choose clip value ?

ptrblck · June 30, 2020, 9:31pm

If larger gradient magnitudes are expected and would thus create invalid values, you might clip the gradients. You could start with a max norm value of 1 or refer to any paper, which uses a similar approach.

Note however, that FloatTensors have a maximal value of:

print(torch.finfo().max)
> 3.4028234663852886e+38

so you should make sure that the NaNs are not created by an invalid operation.

monster · July 1, 2020, 11:50am

by using torch.autograd.set_detect_anomaly(True) I found this error. My dataset seems ok so to resolve this issue should I use gradient clipping or just ignore ‘nan’ values using torch.isnan(x) ?

RuntimeError: Function 'SmoothL1LossBackward' returned nan values in its 0th output

ptrblck · July 1, 2020, 5:02pm

I would recommend to try to figure out what is causing the NaNs instead of ignoring them.
Based on the raised error, the loss function might either have created the NaNs or might have gotten them through their input.

To isolate it, you could try to make the script deterministic following the reproducibility docs. Once it’s deterministic and you could trigger the NaNs in a single step, you could check the parameters, inputs, gradients etc. for the iteration which causes the NaNs.

malisit · July 30, 2020, 1:42pm

I don’t know if this applies to this case and I made sure nothing’s wrong with my data, but I see nans after some time when I use RMSProp but not with Adam. Try changing your optimizer maybe? A similar experience is shared for Keras as well.