Implementation of AdamW

eunseokyang · December 9, 2020, 12:52pm

I have a question about implementation of torch.optim.adamw.
I think the weight decay should be applied at the end of the algorithm according to the paper, however, it applied weight decay first and execute the rest in code.

Can I know if it is correct?