The implementation of AdamW in torch is different from the original paper?

When I was looking up to the implementation of AdamW in torch, I found it was different from the original paper. In the original paper, the weight decay item was multiplied by lr_scheduler multiplier while in torch it was multiplied by lr. I also check the source code of the paper and I am pretty sure it is different from the torch implementation. Why? Am I misunderstanding something?

original paper:

Pytorch implementation: AdamW — PyTorch 2.7 documentation

2 Likes

Hello. This is actually a well-known issue.

See: How to jointly tune learning rate and weight decay for AdamW - Fabian Schaipp

I would also love to see this bug fixed.