I have a loss function, that has values around 5.0 at the beginning of the training and goes down to 0.0003 during training. Then I get the error
RuntimeError Traceback (most recent call last) \lib\site-packages\torch\autograd\__init__.py:173, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs) 170 # The reason we repeat same the comment below is that 171 # some Python versions print out the first line of a multi-line function 172 # calls in the traceback and some print out the last line --> 173 Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass 174 tensors, grad_tensors_, retain_graph, create_graph, inputs, 175 allow_unreachable=True, accumulate_grad=True) RuntimeError: Function 'MseLossBackward0' returned nan values in its 0th output. ... 170 if isinstance(error, str): 171 error = AssertionError(error) --> 172 raise error AssertionError:
Neither the loss function nor the prediction or the label have nan values. That’s why I assume, the loss function is too small and creates nan values in the backward.
Is that true
and when is the loss function to small?
Should I multiply the loss by some factor if I want to train further?