How to avoid nan loss when using fp16 training?

Hi, I am using roberta-base to train RTE dataset. When I use torch.half() to change my models parameters, I find that after first backward, the loss of model will be nan.
Is there any way to solve it?

FP16 has a limited range of ~+/-65k, so you should either use the automatic mixed-precision util. via torch.cuda.amp (which will use FP16 where it’s considered to be save and FP32 where needed) or you would have to transform the data and parameters to FP32 for numerically sensitive operations manually in case you want to stick to a manual approach.