Difference between Adam and AdamW implementation

What is the difference between the implementation of Adam(weight_decay=…) and AdamW(weight_decay=…)?
They look the same to me, except that AdamW has a default value for the weight decay.

Please check the paper behind AdamW:

1 Like

I consulted the official documentation of Adam & AdamW and noticed that the implementation of weight-decay in Adam also followed the Decoupled Weight Decay Regularization (torch.optim — PyTorch 1.7.0 documentation) which is the same for Adam. Does that mean that currently, Adam & AdamW are the same w.r.t. weight decay?

1 Like

I have the same question!

I guess the issue is addressed in this thread which led to a (pending) update on the official document.
A quick conclusion is that the actual implementation of weight decay in Adam still follows the original L2-regularization, despite the documentation, thus, maybe using AdamW is still a better choice.

1 Like

Thank you!
One more question… Have you used AdamW with 1CycleLearningRate policy in Pytorch? How to use them correctly?

Haven’t tried it yet… Maybe you can find some tutorials on your topic/case?

1 Like