How pytorch implement weight_decay?

malioboro · October 8, 2017, 5:41pm

I’m sorry, I’m new in pytorch, and I can’t find how pytorch implement L2 regularization (weigh_decay)?
I mean there are several styles of formula out there to implement L2 regularization, which one is implemented in pytorch? because it leads to how big is value needed to assigned

Thank You

richard · October 9, 2017, 2:43pm

Looking at the code for the SGD optimizer in particular it looks like it’s implemented by
adding weight_decay * data to the gradients. Does this answer your question?

malioboro · October 9, 2017, 4:58pm

why weight_decay * data?
does the line:

if weight_decay != 0:
    d_p.add_(weight_decay, p.data)

means weight_decay + data?

richard · October 9, 2017, 5:19pm

That line means, in other notation:
d_p = d_p + weight_decay * p.data.

Here’s a good article about why the L2 penalty is implemented by adding weight_decay * weight_i to the gradient: https://stats.stackexchange.com/questions/29130/difference-between-neural-net-weight-decay-and-learning-rate

malioboro · October 10, 2017, 7:26am

Thank you for your explanation and your reference too,

I wonder where I can find the refference of d_p.add_(weight_decay, p.data) refers to d_p = d_p + weight_decay * p.data?

trypag · October 10, 2017, 8:38am

torch.add(input, value=1, other, out=None)

Each element of the Tensor other is multiplied by the scalar value and added to each element of the Tensor input. The resulting Tensor is returned.

The shapes of input and other must be broadcastable.

out=input+(other∗value)

If other is of type FloatTensor or DoubleTensor, value must be a real number, otherwise it should be an integer.
http://pytorch.org/docs/master/torch.html

malioboro · October 10, 2017, 9:04am

Thank You, It’s clear now