Simple L2 regularization?

Kalamaya · January 22, 2017, 10:47pm

Hi, does simple L2 / L1 regularization exist in pyTorch? I did not see anything like that in the losses.

I guess the way we could do it is simply have the data_loss + reg_loss computed, (I guess using nn.MSEloss for the L2), but is an explicit way we can use it without doing it this way?

Thanks

fmassa · January 22, 2017, 11:04pm

Hi,

The L2 regularization on the parameters of the model is already included in most optimizers, including optim.SGD and can be controlled with the weight_decay parameter as can be seen in the SGD documentation.

L1 regularization is not included by default in the optimizers, but could be added by including an extra loss nn.L1Loss in the weights of the model.

l1_crit = nn.L1Loss(size_average=False)
reg_loss = 0
for param in model.parameters():
    reg_loss += l1_crit(param)

factor = 0.0005
loss += factor * reg_loss

Note that this might not be the best way of enforcing sparsity on the model though.

Kalamaya · January 22, 2017, 11:15pm

Thanks @fmassa - although I must say that’s odd that a regularization loss in included in the optimizer here. 0_0

apaszke · January 22, 2017, 11:59pm

Yeah, that’s been added there as an optimization, as L2 regularization is often used.

Kalamaya · January 23, 2017, 12:38am

Got it thanks!

ecolss · March 11, 2017, 1:36am

what do you recommend which would be a better way to enforce sparsity instead of L1?

fmassa · March 11, 2017, 11:05am

This comment might be helpful https://github.com/torch/optim/pull/41#issuecomment-73935805

ecolss · March 11, 2017, 1:12pm

Thanks for the note

ncullen93 · April 12, 2017, 9:23pm

@fmassa does this still work? It seems that nn.L1Loss requires a target - giving the error TypeError: forward() missing 1 required positional argument: 'target'

example:

xx = nn.Parameter(torch.from_numpy(np.ones((3,3))))
l1_crit = nn.L1Loss()
l1_crit(xx)

fmassa · April 13, 2017, 10:14am

I forgot to add the target, which in some cases would be a zero-tensor.
So something like

xx = nn.Parameter(torch.from_numpy(np.ones((3,3))))
target = Variable(torch.from_numpy(np.zeros((3,3))))
l1_crit = nn.L1Loss()
l1_crit(xx, target)

Ja-Keoung_Koo · May 6, 2017, 4:43am

I have two questions about L1 regularization:

How do we backpropagate for L1 regularization? I wonder it because the term is not differentiable.
Where can I see the implementation of L1 regularization? In the following link, there is only pass.

github.com

pytorch/pytorch/blob/ecd51f8510bb1c593b0613f3dc7caf31dc29e16b/torch/nn/modules/loss.py#L39


def __init__(self, weight=None, size_average=True):
    super(_WeightedLoss, self).__init__(size_average)
    self.register_buffer('weight', weight)


def forward(self, input, target):
    _assert_no_grad(target)
    backend_fn = getattr(self._backend, type(self).__name__)
    return backend_fn(self.size_average, weight=self.weight)(input, target)




class L1Loss(_Loss):
r"""Creates a criterion that measures the mean absolute value of the
element-wise difference between input `x` and target `y`:


:math:`{loss}(x, y)  = 1/n \sum |x_i - y_i|`


`x` and `y` arbitrary shapes with a total of `n` elements each.


The sum operation still operates over all the elements, and divides by `n`.


The division by `n` can be avoided if one sets the constructor argument `size_average=False`

smth · May 7, 2017, 2:31pm

you can see the implementation of L1Loss here: https://github.com/pytorch/pytorch/blob/ecd51f8510bb1c593b0613f3dc7caf31dc29e16b/torch/lib/THNN/generic/L1Cost.c

gwg · September 14, 2017, 6:31pm

@fmassa, so because of PyTorch’s autograd functionality, we do not need to worry about L1 regularization during the backward pass, i.e. applying the derivative of the L1 regularization term to the gradient of the output? That will be handled by the autograd variables?

fmassa · September 17, 2017, 12:06pm

If you add the L1 regularization to the loss as I explained in a previous post, the gradients will be handled by autograd automatically.

yibo · December 26, 2017, 7:15pm

I like to use l1_loss=F.l1_loss(xx, target=torch.zeros_like(xx), size_average=False)

Brando_Miranda · February 3, 2018, 10:59pm

if I want to use a custom Regularizer R can is the following code good:

batch_loss = loss(input=y_pred,target=batch_ys)
batch_loss += lambda*R

?

hughperkins · June 4, 2018, 12:59am

@fmassa You say “this might not be the best way of enforcing sparsity on the model”, and “This comment might be helpful https://github.com/torch/optim/pull/41#issuecomment-73935805 1.2k”, which comment explains to explicitly set to 0 any weights changing sign. Does this mean that you feel that L1 with explicit zeroing of weights crossing zero is an appropriate way of encouraging sparsity? Or do you mean, there are some other approach(es) that can work well?

Neta_Zmora · June 8, 2018, 5:48pm

Hi @hughperkins,

If you are interested in inducing sparsity, you might want to checkout this project from Intel AI Labs. The documentation tries to shed some light on recent research related to sparsity inducing methods.

The project includes a stand-alone Jupyter notebook that attempts to show how L1 regularization can be used to induce sparsity (by “stand-alone” I mean that the notebook does not import any code from Distiller, so you can just try it out).

Cheers,
Neta

shirui-japina · September 22, 2019, 7:00am

Sorry for question here.
It is said that when regularization L2, it should only for weight parameters, but not bias parameters.(Is it right?)
But the L2 regularization included in most optimizers in PyTorch, is for all of the parameters in the model (weight and bias).
least_squares_l2
I mean the parameters in the red box should be weight parameters only. (If what I heard of is right.)
How can I deal with it?

weight_p, bias_p = [],[]
for name, p in model.named_parameters():
  if 'bias' in name:
    bias_p += [p]
  else:
    weight_p += [p]

optim.SGD(
  [
    {'params': weight_p, 'weight_decay':1e -5},
    {'params': bias_p, 'weight_decay':0}
  ],
  lr=1e-2, momentum=0.9
)

Code here can deal with the problem above, is it right?

Deepak · April 25, 2020, 4:19pm

We do regularization to handle the high variance problem (overfitting). It’s good if the regularization includes all the learnable parameters (both weight and bias). But since bias is only a single parameter out of the large number of parameters, it’s usually not included in the regularization; and exclusion of bias hardly affects the results.