Hi,
I’m training my own dataset with mrcnn(box head only).
At beginning, the loss value kept dropping in general until it suddenly popped out ‘nan’.
So I tried with autograd.detect_anomaly, and here is what I got:
RuntimeError Traceback (most recent call last)
<ipython-input-17-f414452ff0e1> in <module>()
37 loss_train.append(smooth_tra.avg)
38 optimizer.zero_grad()
---> 39 losses.backward()
40 optimizer.step()
41 #lr_scheduler.step()
1 frames
/usr/local/lib/python3.6/dist-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
196 products. Defaults to ``False``.
197 """
--> 198 torch.autograd.backward(self, gradient, retain_graph, create_graph)
199
200 def register_hook(self, hook):
/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
98 Variable._execution_engine.run_backward(
99 tensors, grad_tensors, retain_graph, create_graph,
--> 100 allow_unreachable=True) # allow_unreachable flag
101
102
RuntimeError: Function 'SmoothL1LossBackward' returned nan values in its 0th output.
What does this traceback mean? Any reason could cause this problem?
Thanks for any advice!
It means that the first nan to appear in this backward pass was in the backward pass of the Smooth L1 Loss that you’re using.
If this was the last Function, maybe the loss value is nan already? Or it is evaluated at a point where the gradient will lead to nan.
Sorry, my response is a little late, but the clip_grad_norm_ doesn’t work. After random size of iterations, the loss always became nan.
Does it matter if the data augmentation is way too much?
No, it didn’t diverge. Just keep dropping like normal. About converging towards a specific value, I don’t know, during my training, nan could happen in a range of loss like from 0.10 to 0.40. And the best loss I got so far is about 0.13.
This is quite surprising.
But if the loss doesn’t diverge, you want to add prints in your evaluation code to see exactly where the first nan appears. either because of a division by 0 or something similar. and make sure it doesn’t happen anymore (adding an eps to division)
It does but only in the backward pass.
So if the first one appears in the forward, it will only catch it at the beginning of the backward (loss function here).