RuntimeError: Function 'LogSoftmaxBackward0' returned nan values in its 0th output

Hi, when running quite a deep transformer network, I am randomly seeing this error (from torch.autograd.set_detect_anomaly(True)). My entire model becomes NaN.

So, I’ve tried all the stuff I could find online.

  1. Check all the input and target data, so that there is no inf or nan
  2. Did all normalizations properly
  3. Experimented with very low lr
  4. Using clip-grad-norm
  5. Reduced gradient acum iter and also batch size

None of them seem to improve my situation, and I am wondering what can be the cause of this.

I have also printed the input and output of the log softmax layer. For the step when everything breaks down and becomes NaN, none of the values seem weird, there is no nan (torch.isnan()) there is no inf (torch.isinf()), also the min and max values seem to be reasonable (between -15 to +15)

So, I am wondering what can be causing this? Can it be the reason that somehow the softmax results are close to 0 and numerical instability is making the log work on negative?

Thank you so much for your suggestions.

1 Like