Why is AdamW not the default weight decay implementation in pytorch?

sad_robot · July 9, 2025, 6:35pm

I noticed that the default torch.optim.Adam() optimizer has a weight_decay=0 hyper parameter, yet torch.optim.AdamW is a separate implementation (why not replace the original?). Additionally torch.optim.RAdam() optimizer has a weight_decay=0 and a decoupled_weight_decay=False hyper parameter.

Both of these design choices seem to indicate that AdamW’s decoupled weight decay implementation is only situationally useful and that the original “coupled weight decay” algorithm is a better default choice for most use cases. Which obviously directly contradicts the AdamW paper.

I’m wondering if there were later research results that gave the Pytorch designers this impression? Or if perhaps there have been issues reproducing AdamW results or something?
Intuitively I don’t understand why you would couple momentum with a regularization term.

ptrblck · July 9, 2025, 6:58pm

A lot of defaults are kept for backward compatibility reasons and you would sometimes see that higher-level API libs are changing these to the current SOTA usage.
I haven’t checked the details here but this could also be the case for potentially sub-optimal arguments used in these optimizers.