Nan Loss coming after some time

The loss function is a combination of Mean Sqaured error loss and cross-entropy loss.
When i am training my model, there is a finite loss but after some time, the loss is NaN and continues to be so.
When I am training my model just on a single batch of 10 images, the loss is finite most of the times, but sometimes that is also NaN.
Please suggest a possible solution.

Thanks in advance


Usually, the gradients become NaN first. The first two things to look at are a reduced learning rate and possibly gradient clipping.

Best regards



@tom’s suggestions are great. Normalizing the data may help as well.


Hi @tom
Thanks for your reply.
Can you also suggest how can I check if the gradients are becoming NaN first and also how can I ensure gradient clipping with a SGD or AdaGrad Optimizer.


1 Like

I tried doing it. But how should I decide the mean and standard deviation for the operation?


You could use a normalization layer. Alternatively, you can try dividing by some constant first (perhaps equal to the max value of your data?) The idea is to get the values low enough that they don’t cause really large gradients.

1 Like

Hi, I am facing the same issue, I am wondering if you have solved it please?

Have you tried the suggestions from this thread and is nothing working?
If so, when do you see the first NaN value?
Could you additionally check out input for NaN values?

Thanks for reply, none of the suggestion in this thread worked for me.
Finally, I have solved my issue by your suggestion in another thread Getting Nan after first iteration with custom loss .


Here is a way of debuging the nan problem.
First, print your model gradients because there are likely to be nan in the first place.
And then check the loss, and then check the input of your loss…Just follow the clue and you will find the bug resulting in nan problem.

There are some useful infomation about why nan problem could happen:
1.the learning rate


Why is sqrt(0) a Nan? Should it not be equal to 0?


I have 100 folders of different class images and I am getting nan loss value in some folders, I already checked gray scale,truncated,missing labels etc everything is fine but still getting ‘nan’ loss. What could be the possible reason ?

Do you get a NaN output from your model if you are using samples from certain folders or what do you mean by:

If your model is returning NaNs, you could set torch.autograd.detect_anomaly(True) at the beginning of your script to get a stack trace, which would hopefully point to the operation, which is creating the NaNs.

I am getting ‘nan’ loss after 1st epoch on a large dataset, please tell me all possible reasons for ‘nan’ loss value.
check this :

dict_values([tensor(5.5172, device='cuda:0', grad_fn=<NllLossBackward>), tensor(nan, device='cuda:0', grad_fn=<DivBackward0>), tensor(3.7665, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>), tensor(inf, device='cuda:0', grad_fn=<DivBackward0>)])

NaN values can be created by invalid operations, such as torch.log(torch.tensor(-1.)), by operations executed on Infs (created through over-/underflows) etc.
To isolate it, use torch.autograd.set_detect_anomaly(True).


Thanks for reply.
Can I just use gradient clipping? If yes,how can I choose clip value ?

If larger gradient magnitudes are expected and would thus create invalid values, you might clip the gradients. You could start with a max norm value of 1 or refer to any paper, which uses a similar approach.

Note however, that FloatTensors have a maximal value of:

> 3.4028234663852886e+38

so you should make sure that the NaNs are not created by an invalid operation.


by using torch.autograd.set_detect_anomaly(True) I found this error. My dataset seems ok so to resolve this issue should I use gradient clipping or just ignore ‘nan’ values using torch.isnan(x) ?

RuntimeError: Function 'SmoothL1LossBackward' returned nan values in its 0th output

I would recommend to try to figure out what is causing the NaNs instead of ignoring them.
Based on the raised error, the loss function might either have created the NaNs or might have gotten them through their input.

To isolate it, you could try to make the script deterministic following the reproducibility docs. Once it’s deterministic and you could trigger the NaNs in a single step, you could check the parameters, inputs, gradients etc. for the iteration which causes the NaNs.


I don’t know if this applies to this case and I made sure nothing’s wrong with my data, but I see nans after some time when I use RMSProp but not with Adam. Try changing your optimizer maybe? A similar experience is shared for Keras as well.