Weight_decay in torch.Adam

maridia · December 3, 2020, 5:19am

In the current pytorch docs for torch.Adam, the following is written:

"Implements Adam algorithm.

It has been proposed in Adam: A Method for Stochastic Optimization. The implementation of the L2 penalty follows changes proposed in Decoupled Weight Decay Regularization."

This would lead me to believe that the current implementation of Adam is essentially equivalent to AdamW. The fact that torch.AdamW exists as a separate optimizer leads me to believe that this isn’t true. Also, after looking at the source code of torch.Adam, I don’t see any difference from a standard L2 penalty implementation. So is the documentation incorrect and the “changes proposed in [Decoupled Weight Decay Regularization]” are actually absent from torch.Adam?

ptrblck · December 3, 2020, 8:49am

This might be indeed a documentation issue, as I cannot see any point of using AdamW in this case.
Would you mind creating a GitHub issue so that we can track it and would you be interested in fixing the doc?

maridia · December 3, 2020, 6:19pm

I’ve made a new issue here:

I’m not sure what is involved in fixing the doc myself as I’ve never opened a pull request before.

ptrblck · December 4, 2020, 4:27am

Thanks for creating the issue! I’m sure we can guide you through the PR.