Difference between Adam and AdamW implementation

NightRain · October 14, 2020, 5:28pm

What is the difference between the implementation of Adam(weight_decay=…) and AdamW(weight_decay=…)?
They look the same to me, except that AdamW has a default value for the weight decay.

fadetoblack · October 15, 2020, 6:57am

Please check the paper behind AdamW:

AOZMH · January 17, 2021, 4:42am

I consulted the official documentation of Adam & AdamW and noticed that the implementation of weight-decay in Adam also followed the Decoupled Weight Decay Regularization (torch.optim — PyTorch 1.7.0 documentation) which is the same for Adam. Does that mean that currently, Adam & AdamW are the same w.r.t. weight decay?

hoangle_tttm · March 29, 2021, 5:03am

I have the same question!

AOZMH · March 29, 2021, 6:58am

I guess the issue is addressed in this thread which led to a (pending) update on the official document.
A quick conclusion is that the actual implementation of weight decay in Adam still follows the original L2-regularization, despite the documentation, thus, maybe using AdamW is still a better choice.

hoangle_tttm · March 29, 2021, 8:38am

Thank you!
One more question… Have you used AdamW with 1CycleLearningRate policy in Pytorch? How to use them correctly?

AOZMH · March 29, 2021, 9:09am

Haven’t tried it yet… Maybe you can find some tutorials on your topic/case?