Difference between Adam and AdamW implementation

I guess the issue is addressed in this thread which led to a (pending) update on the official document.
A quick conclusion is that the actual implementation of weight decay in Adam still follows the original L2-regularization, despite the documentation, thus, maybe using AdamW is still a better choice.

1 Like