As discussed in the paper Decoupled Weight Decay Regularization PyTorch implements weight decay incorrectly for Adam (and other optims). Although it is easy to deal with this problem by adding some code after
loss.backward() but why is pytorch still using the incorrect implementation?
The code for correcting it is, with wd=0 for optim
for group in optimizer.param_groups: for param in group['params']: param.data = param.data.add(-wd * group['lr'], param.data) optimizer.step()
Is this question correct for Site Feedback category or should it be in Uncategorized?