RuntimeError: Function 'SmoothL1LossBackward' returned nan values in its 0th output

I’m training my own dataset with mrcnn(box head only).
At beginning, the loss value kept dropping in general until it suddenly popped out ‘nan’.
So I tried with autograd.detect_anomaly, and here is what I got:

RuntimeError                              Traceback (most recent call last)
<ipython-input-17-f414452ff0e1> in <module>()
     37               loss_train.append(smooth_tra.avg)
     38           optimizer.zero_grad()
---> 39           losses.backward()
     40           optimizer.step()
     41           #lr_scheduler.step()

1 frames
/usr/local/lib/python3.6/dist-packages/torch/ in backward(self, gradient, retain_graph, create_graph)
    196                 products. Defaults to ``False``.
    197         """
--> 198         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    200     def register_hook(self, hook):

/usr/local/lib/python3.6/dist-packages/torch/autograd/ in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     98     Variable._execution_engine.run_backward(
     99         tensors, grad_tensors, retain_graph, create_graph,
--> 100         allow_unreachable=True)  # allow_unreachable flag

RuntimeError: Function 'SmoothL1LossBackward' returned nan values in its 0th output.

What does this traceback mean? Any reason could cause this problem?
Thanks for any advice!


It means that the first nan to appear in this backward pass was in the backward pass of the Smooth L1 Loss that you’re using.
If this was the last Function, maybe the loss value is nan already? Or it is evaluated at a point where the gradient will lead to nan.

Yes, its loss was nan and I’ll try to clip the gradient.

Sorry, my response is a little late, but the clip_grad_norm_ doesn’t work. After random size of iterations, the loss always became nan.
Does it matter if the data augmentation is way too much?


Does the loss diverge before getting to nan? Or does it actually converge towards a very specific value before becoming nan?

No, it didn’t diverge. Just keep dropping like normal. About converging towards a specific value, I don’t know, during my training, nan could happen in a range of loss like from 0.10 to 0.40. And the best loss I got so far is about 0.13.

This is quite surprising.
But if the loss doesn’t diverge, you want to add prints in your evaluation code to see exactly where the first nan appears. either because of a division by 0 or something similar. and make sure it doesn’t happen anymore (adding an eps to division)

Thanks for your advice. I’ll do it.
(BTW, shouldn’t the function of ‘detect_anomaly’ check the very first nan?)

It does but only in the backward pass.
So if the first one appears in the forward, it will only catch it at the beginning of the backward (loss function here).

Oh~ I see. Thank you very much!