How pytorch implement weight_decay?

I’m sorry, I’m new in pytorch, and I can’t find how pytorch implement L2 regularization (weigh_decay)?
I mean there are several styles of formula out there to implement L2 regularization, which one is implemented in pytorch? because it leads to how big is value needed to assigned

Thank You

Looking at the code for the SGD optimizer in particular it looks like it’s implemented by
adding weight_decay * data to the gradients. Does this answer your question?

why weight_decay * data?
does the line:

if weight_decay != 0:
    d_p.add_(weight_decay, p.data)

means weight_decay + data?

That line means, in other notation:
d_p = d_p + weight_decay * p.data.

Here’s a good article about why the L2 penalty is implemented by adding weight_decay * weight_i to the gradient: https://stats.stackexchange.com/questions/29130/difference-between-neural-net-weight-decay-and-learning-rate

5 Likes

Thank you for your explanation and your reference too,

I wonder where I can find the refference of d_p.add_(weight_decay, p.data) refers to d_p = d_p + weight_decay * p.data?

torch.add(input, value=1, other, out=None)

Each element of the Tensor other is multiplied by the scalar value and added to each element of the Tensor input. The resulting Tensor is returned.

The shapes of input and other must be broadcastable.

out=input+(other∗value)

If other is of type FloatTensor or DoubleTensor, value must be a real number, otherwise it should be an integer.
http://pytorch.org/docs/master/torch.html

2 Likes

Thank You, It’s clear now :slight_smile: