The problematic Adam stability in Automatic Mixed Precision training

tom · November 23, 2021, 11:26am

First, I have to say you know a lot more about your situation also in training than me, so take all this with a grain of salt.

The key advantage of LAMB/LARS to me here is that they work at the tensor level while Adam works at the “single scalar entry” level for the adaptiveness. LARS might a more natural change from Adam than LAMB.
If you are concerned about batch size for the gradient step, you could accumulate a few batches before taking an optimizer step.

Best regards

Thomas