A bug of pytorch about optim.sgd(weight_decay)

Mandy · September 9, 2019, 2:26am

When I was looking into the source codes of optim.sgd(), I found that

for p in group['params']:
      if p.grad is None:
            continue
      d_p = p.grad.data
      if weight_decay != 0:
            d_p.add_(weight_decay, p.data)

where I think that weight_decay is used for L1 penalty(maybe I was wrong? There is only a add_ between weight_decay and weight-p). But official doc points out that weight decay (L2 penalty), is this a bug?

albanD · September 9, 2019, 10:50pm

Hi,

Weight decay is an l2 penalty on the loss function.
So when you take the derivative, it becomes just the value of the weights (times 2).
So we add that directly to the gradients.

dsuess · September 9, 2019, 11:38pm

For non-adaptive optimizers without momentum, weight decay is the same (up to the factor of 2 mentioned by @albanD) as an additional L2-penalty added to the loss function. For optimizers such as Adam, the empirical evidence suggests that simple weight decay as implemented above outperforms a proper L2-penalty, but the interpretation isn’t as clear. For more details, have a look at these reviews as well as the article.

Mandy · September 10, 2019, 2:01am

@albanD Thank you for your explanation, that makes sense.

Mandy · September 10, 2019, 2:02am

Thank you @dsuess for that helpful link and your explanation