I’m using the Huggingface microsoft/mdeberta-v3-base pretrained model to finetune on my task. I discovered that during training, if using AMP, the optimizer never gets updated. step() method of the optimizer never gets called by scaler, because the scale of the scaler keeps shrinking until this float point number itself gets underflowed. So the model cannot be updated. Once turned off AMP, everything works fine.
Moreover, I tried finetuning other pretrained models from huggingface, like microsoft/deberta-v3-base and more, with AMP on, and they all work fine. The scale would typically hold at 16384.
I’m pretty sure that my code for AMP usage has nothing problematic since it only fails on this particular pretrained model. I tried increasing the init_scale of GradScaler, from the default 2**16 to 2**40 with exponent interval 4, and they all failed.
I wonder if there is a workaround for this. This particular pretrained model, mdeberta-v3-base, is pretty strong, and we don’t wanna give it up, nor training it with AMP off, which costs a huge amount of time.
This could mean that the forward pass or the loss calculation would create invalid values and the GradScaler tries to decrease the scaling factor in case it’s too large.
However, if the output is already invalid, decreasing the scaling factor would of course not work anymore, so you shoulde definitely check is the predictions or loss already contains Infs or NaNs.
If so, check if the model was pretrained with bfloat16 as it has a larger range and less precision than float16. In that case, you might want to use autocast with dtype=torch.bfloat16.
This wouldn’t help and would create overflows more easily.
Thanks for your informative advice. The loss is normal, so maybe there are some NaN or infs in gradient or params. Using torch.bfloat16 solves the problem, though it comes alongside with some performance decrease. Besides, not all GPUs support bfloat16, including V100.