What is the difference between the implementation of Adam(weight_decay=…) and AdamW(weight_decay=…)?
They look the same to me, except that AdamW has a default value for the weight decay.
Please check the paper behind AdamW:
I consulted the official documentation of Adam & AdamW and noticed that the implementation of weight-decay in Adam also followed the Decoupled Weight Decay Regularization (torch.optim — PyTorch 1.7.0 documentation) which is the same for Adam. Does that mean that currently, Adam & AdamW are the same w.r.t. weight decay?
I have the same question!
I guess the issue is addressed in this thread which led to a (pending) update on the official document.
A quick conclusion is that the actual implementation of weight decay in Adam still follows the original L2-regularization, despite the documentation, thus, maybe using AdamW is still a better choice.
Thank you!
One more question… Have you used AdamW with 1CycleLearningRate policy in Pytorch? How to use them correctly?