Impact of learning rate in Mixed Precision training

tam · January 26, 2021, 4:18pm

Is there any minimum learning rate, for training with amp (mixed precision training). Say if lr scheduler drops the lr to a high precision value (say 10^-20) then will the weight get updated in the that epoch?
I understand that loss and some other operators are calculated in fp32 during mixed-precision training, so they may not have much impact, but how the operators that use fp16, will the weight update stop for them?

seungjun · January 27, 2021, 11:05am

toch.cuda.amp.autocast() will use fp16 representation for allowed list of operations regardless of the magnitude.
If your loss / gradient is too small, they will fall to 0.

Gradient scaling (torch.cuda.amp.GradScaler) is there to prevent gradients from being 0 if possible.
It

scales your loss with a large value (e.g.1024) and then
compute the gradients
unscale the gradient

If all the values required in the above process are in fp16 range, amp may work.
If any of them is larger than the fp16 maximum or smaller than the minimum, amp may not work.
You will have to find your optimal learning rate by running experiments and checking the gradient values.

You can refer to wikipedia page to check the fp16 precision limitation.
I guess lr=10^-20 won’t work in fp16 for most of the cases.