-
Your
NaNsare emerging when calculating the gradient of your loss w.r.t to your parameters, so you won’t see them in your input. You’ll only see them when computing gradients. If your Loss isInf, the gradients of that loss w.r.t the parameters will beNaN. -
Clamping the output to stop it overflow could help, but a simplier solution would be to ask if you really need to be running your code at
torch.float16? -
The hook prints the gradient that is used during optimizer, so I assume it’s the scaled gradient. (As that’s what AMP uses during backprop).
- For 4/5, I’d recommend reading the docs for it’s usage on AMP and how it’s used during the backward pass. → PyTorch Lightning — PyTorch Lightning 1.5.0dev documentation