Why the loss_scale getting smaller and smaller?

Hello, now I am pretraining a HuBERT model, I have found that when the loss getting smaller, but increasingly frequent occurrences of gradient overflow, leading to a decreasing loss_scale.

My understanding is that when the loss becomes smaller and smaller, it indicates that the model has converged. So why would gradient overflow occur more frequently?

You could unscale the gradients manually to inspect them to see which ones are overflowing. Unscaling Infs or NaNs will of course keep these invalid values but it should give you an idea where in the model the gradients start to overflow.

1 Like