When I was looking into the source codes of optim.sgd(), I found that
for p in group['params']:
if p.grad is None:
d_p = p.grad.data
if weight_decay != 0:
where I think that
weight_decay is used for
L1 penalty(maybe I was wrong? There is only a add_ between weight_decay and weight-p). But official doc points out that
weight decay (L2 penalty), is this a bug?
Weight decay is an l2 penalty on the loss function.
So when you take the derivative, it becomes just the value of the weights (times 2).
So we add that directly to the gradients.
For non-adaptive optimizers without momentum, weight decay is the same (up to the factor of 2 mentioned by @albanD) as an additional L2-penalty added to the loss function. For optimizers such as Adam, the empirical evidence suggests that simple weight decay as implemented above outperforms a proper L2-penalty, but the interpretation isn’t as clear. For more details, have a look at these reviews as well as the article.
@albanD Thank you for your explanation, that makes sense.
Thank you @dsuess for that helpful link and your explanation