When I was looking up to the implementation of AdamW in torch, I found it was different from the original paper. In the original paper, the weight decay item was multiplied by lr_scheduler multiplier while in torch it was multiplied by lr. I also check the source code of the paper and I am pretty sure it is different from the torch implementation. Why? Am I misunderstanding something?
original paper:
Pytorch implementation: AdamW — PyTorch 2.7 documentation
