In the current pytorch docs for torch.Adam, the following is written:
"Implements Adam algorithm.
It has been proposed in Adam: A Method for Stochastic Optimization. The implementation of the L2 penalty follows changes proposed in Decoupled Weight Decay Regularization."
This would lead me to believe that the current implementation of Adam is essentially equivalent to AdamW. The fact that torch.AdamW exists as a separate optimizer leads me to believe that this isn’t true. Also, after looking at the source code of torch.Adam, I don’t see any difference from a standard L2 penalty implementation. So is the documentation incorrect and the “changes proposed in [Decoupled Weight Decay Regularization]” are actually absent from torch.Adam?