I would love to be able to use automatic mixed precision more extensively in my training, but I find that it is too unstable and often ends in NaNs. Are there any general tricks in training that people here have used to improve stability?
I’ve seen the following general tips:
- plot the gradients and force unstable layers to fp32
- bump weight decay in the optimizer
- bump epsilon in the optimizer
- try an exotic optimizer
- add/try different normalization layers
- force loss calculations to fp32