When I was looking into the source codes of optim.sgd(), I found that

for p in group['params']:
if p.grad is None:
continue
d_p = p.grad.data
if weight_decay != 0:
d_p.add_(weight_decay, p.data)

where I think that weight_decay is used for L1 penalty(maybe I was wrong? There is only a add_ between weight_decay and weight-p). But official doc points out that weight decay (L2 penalty), is this a bug?

Weight decay is an l2 penalty on the loss function.
So when you take the derivative, it becomes just the value of the weights (times 2).
So we add that directly to the gradients.

For non-adaptive optimizers without momentum, weight decay is the same (up to the factor of 2 mentioned by @albanD) as an additional L2-penalty added to the loss function. For optimizers such as Adam, the empirical evidence suggests that simple weight decay as implemented above outperforms a proper L2-penalty, but the interpretation isnâ€™t as clear. For more details, have a look at these reviews as well as the article.