LayerNorm's grads become NaN after first epoch

AlphaBetaGamma96 · October 2, 2021, 12:39pm

Your NaNs are emerging when calculating the gradient of your loss w.r.t to your parameters, so you won’t see them in your input. You’ll only see them when computing gradients. If your Loss is Inf, the gradients of that loss w.r.t the parameters will be NaN.
Clamping the output to stop it overflow could help, but a simplier solution would be to ask if you really need to be running your code at torch.float16?
The hook prints the gradient that is used during optimizer, so I assume it’s the scaled gradient. (As that’s what AMP uses during backprop).

For 4/5, I’d recommend reading the docs for it’s usage on AMP and how it’s used during the backward pass. → PyTorch Lightning — PyTorch Lightning 1.5.0dev documentation