When I was looking into the source codes of optim.sgd(), I found that
for p in group['params']: if p.grad is None: continue d_p = p.grad.data if weight_decay != 0: d_p.add_(weight_decay, p.data)
where I think that
weight_decay is used for
L1 penalty(maybe I was wrong? There is only a add_ between weight_decay and weight-p). But official doc points out that
weight decay (L2 penalty), is this a bug?