Difference between Adam and AdamW implementation

AOZMH · March 29, 2021, 6:58am

I guess the issue is addressed in this thread which led to a (pending) update on the official document.
A quick conclusion is that the actual implementation of weight decay in Adam still follows the original L2-regularization, despite the documentation, thus, maybe using AdamW is still a better choice.