Scaler.get_scale() becomes zeros and loss is still nan under autocast

I have been trying to use autocast. My loss is nan and therefore scaler.get_scale() keeps reducing after every batch from 32768 until it becomes zero, so i stop the training. Traceback shows “RuntimeError: Function ‘PowBackward0’ returned nan values in its 0th output.” I have tried to increase epsilon value in the line of code which gives this error. I have also tried to increase the epsilon of Adam optimizer that I’m using to 1e-4.

If your model outputs NaN values, the training is generally broken and I don’t know if the model can somehow recover from it. Additionally the scaler will reduce the loss scaling value, as the backward pass creates NaN gradients, and it thus assumes that the scaling factor is too high.

I would recommend to remove automatic mixed precision for now and make sure to fix the NaN issue.
Based on the error message, you might pass a negative values to a pow method, which would calculate a root?