LayerNorm's grads become NaN after first epoch

First part yes, second part no. Your code producing Infs and NaNs is a bug that results from your code somewhere, it’s just not a result of the inputs.

hmm, that sounds odd given the max value for float32 is around 10^38 . Check there’s no Infs being created as input. You can use a similar method with register_forward_pre_hook (see docs). Perhaps there’s an issue with your LSTM module? (which gets passed to the final Linear layer which has the Infs.

Reading through the pytorch docs on it shows you should use the unscaled gradients, I don’t know how it’s done within PyTorch lightning.
Each parameter’s gradient (.grad attribute) should be unscaled before the optimizer updates the parameters, so the scale factor does not interfere with the learning rate.

Looks good to me

The hooks store Tensors for all inputs samples, if you want the gradients and weights you can print those out with (all though they’ll be reduced to a scalar for all samples, but if there’s a NaN the average over all samples will be NaN too). (So, it’s a simple check).

for name, param in model.named_parameters():
  print(name, param, param.grad)

I don’t know how this works in PyTorch Lightning but see if it’s at all possible to change the scalar_constant that multiple the loss. See if you can reproduce the issue with PyTorch (rather than lightning) via doing AMP in the forward pass only!