As discussed in the paper Decoupled Weight Decay Regularization PyTorch implements weight decay incorrectly for Adam (and other optims). Although it is easy to deal with this problem by adding some code after loss.backward() but why is pytorch still using the incorrect implementation?
The code for correcting it is, with wd=0 for optim
for group in optimizer.param_groups:
for param in group['params']:
param.data = param.data.add(-wd * group['lr'], param.data)
optimizer.step()
Is this question correct for Site Feedback category or should it be in Uncategorized?
Also, if it is not implemented in the library, I think there should be some flag to set it. The reason being Adam in this form does not generalize well and is often outperformed by SGD+momentum.
It’s an interesting read. Maybe, @smth, @albanD, @ptrblck can answer whether this could be/need to be integrated into pytorch.
However, I think there is something wrong with the workaround that you are proposing. The whole idea of the paper is to decouple the (param specific) learning rate (alpha) with decay parameter (lambda). Maybe, group['lr'] need not be multiplied with wd. What do you think? i.e., just
for group in optimizer.param_groups:
for param in group['params']:
param.data = param.data.add(-wd, param.data)
optimizer.step()