I noticed that the default torch.optim.Adam() optimizer has a weight_decay=0 hyper parameter, yet torch.optim.AdamW is a separate implementation (why not replace the original?). Additionally torch.optim.RAdam() optimizer has a weight_decay=0 and a decoupled_weight_decay=False hyper parameter.
Both of these design choices seem to indicate that AdamW’s decoupled weight decay implementation is only situationally useful and that the original “coupled weight decay” algorithm is a better default choice for most use cases. Which obviously directly contradicts the AdamW paper.
I’m wondering if there were later research results that gave the Pytorch designers this impression? Or if perhaps there have been issues reproducing AdamW results or something?
Intuitively I don’t understand why you would couple momentum with a regularization term.