I’m sorry, I’m new in pytorch, and I can’t find how pytorch implement L2 regularization (
I mean there are several styles of formula out there to implement L2 regularization, which one is implemented in pytorch? because it leads to how big is value needed to assigned
Looking at the code for the SGD optimizer in particular it looks like it’s implemented by
weight_decay * data to the gradients. Does this answer your question?
weight_decay * data?
does the line:
if weight_decay != 0:
weight_decay + data?
That line means, in other notation:
d_p = d_p + weight_decay * p.data.
Here’s a good article about why the L2 penalty is implemented by adding
weight_decay * weight_i to the gradient: https://stats.stackexchange.com/questions/29130/difference-between-neural-net-weight-decay-and-learning-rate
Thank you for your explanation and your reference too,
I wonder where I can find the refference of
d_p.add_(weight_decay, p.data) refers to
d_p = d_p + weight_decay * p.data?
torch.add(input, value=1, other, out=None)
Each element of the Tensor other is multiplied by the scalar value and added to each element of the Tensor input. The resulting Tensor is returned.
The shapes of input and other must be broadcastable.
If other is of type FloatTensor or DoubleTensor, value must be a real number, otherwise it should be an integer.
Thank You, It’s clear now