Difference between Adam and AdamW in pytorch

There are a few discussions on the difference between Adam(weight_decay=0.01) and AdamW() which point out that the implementation of weight decay in AdamW is the decoupled weight decay, different from the raw regularization of Adam.

However, I consulted the official documentation of Adam & AdamW and noticed that the implementation of weight-decay in Adam also followed the Decoupled Weight Decay Regularization, which is the same for AdamW. Does that mean that currently, Adam & AdamW are the same w.r.t. weight decay?

This issue should be solved in this PR.

Clear now, many thanks!