AdamW handles weight decay correctly and Pytorch has implemented AdamW. Then, why does Adam have weight_decay?
Also, what’s the correct optimizer in Pytroch to use when I want to use weight decay? I am confused because most of the papers use Adam when they have weight decay instead of AdamW. Why are they doing so?