Why Pytorch does not correct regularization in it's optimizers?

Kushaj · March 30, 2019, 12:08pm

As discussed in the paper Decoupled Weight Decay Regularization PyTorch implements weight decay incorrectly for Adam (and other optims). Although it is easy to deal with this problem by adding some code after loss.backward() but why is pytorch still using the incorrect implementation?

The code for correcting it is, with wd=0 for optim

for group in optimizer.param_groups:
    for param in group['params']:
        param.data = param.data.add(-wd * group['lr'], param.data)
optimizer.step()

Is this question correct for Site Feedback category or should it be in Uncategorized?

Kushaj · March 30, 2019, 12:15pm

Also, if it is not implemented in the library, I think there should be some flag to set it. The reason being Adam in this form does not generalize well and is often outperformed by SGD+momentum.

InnovArul · March 30, 2019, 5:12pm

It’s an interesting read. Maybe, @smth, @albanD, @ptrblck can answer whether this could be/need to be integrated into pytorch.

However, I think there is something wrong with the workaround that you are proposing. The whole idea of the paper is to decouple the (param specific) learning rate (alpha) with decay parameter (lambda). Maybe, group['lr'] need not be multiplied with wd. What do you think? i.e., just

for group in optimizer.param_groups: 
    for param in group['params']: 
        param.data = param.data.add(-wd, param.data) 

optimizer.step()

ptrblck · March 30, 2019, 8:08pm

I think the implementation of AdamW is being tracked here.

Kushaj · March 30, 2019, 8:33pm

Thanks, I missed that.