How pytorch implement weight_decay?

richard · October 9, 2017, 5:19pm

That line means, in other notation:
d_p = d_p + weight_decay * p.data.

Here’s a good article about why the L2 penalty is implemented by adding weight_decay * weight_i to the gradient: https://stats.stackexchange.com/questions/29130/difference-between-neural-net-weight-decay-and-learning-rate