LayerNorm's grads become NaN after first epoch

  1. Generally speaking, is it normal to end up with inf, nan problems in a network even if the inputs do not contain them? Or does it always indicate there is a bug somewhere in our code?

  2. If it is normal, how are we expected to handle it? Are we expected to add clipping or clamping of values somewhere?

  3. The hook you provided prints out the gradients but not the scales and unscaled values of each layer (the loss, in this case). Is there another hook I can add to do that?

  4. The unscaled loss is around 100 which seems quite reasonable/low. How do I find out why it is being scaled up to such a large value in the first place?

  5. I’m not sure if PyTorch Lightning uses AMP in the backwards pass. How do I find out?

Thanks.