How does one implement Weight regularization (l1 or l2) manually without optimum?

I wanted to do it manually so I implemented it as follows:

reg_lambda=1.0
l2_reg=0
for W in mdl.parameters():
    l2_reg += *W.norm(2)
batch_loss = (1/N_train)*(y_pred - batch_ys).pow(2).sum() + reg_lambda*l2_reg
## BACKARD PASS
batch_loss.backward() # Use autograd to compute the backward pass. Now w will have gradients

is this correct? the key part I care about is that the SGD update works correctly. i.e.:

## SGD update
for W in mdl.parameters():
    delta = eta*W.grad.data
    W.data.copy_(W.data - delta) # W - eta*g + A*gdl_eps

has the 2w term in the SGD update.

here is a related question: Simple L2 regularization?

3 Likes

it’s almost correct.

l2_reg here is a python scalar, so operations done on it are not recorded for the autograd backward().
Instead, you should make l2_reg to be an autograd Variable.

l2_reg = None
for W in mdl.parameters():
    if l2_reg is None:
        l2_reg = W.norm(2)
    else:
        l2_reg = l2_reg + W.norm(2)
batch_loss = (1/N_train)*(y_pred - batch_ys).pow(2).sum() + l2_reg * reg_lambda
batch_loss.backward()
16 Likes

Why not:

    l2_reg = Variable( torch.FloatTensor(1), requires_grad=True)
    for W in mdl.parameters():
        l2_reg = l2_reg + W.norm(2)

(not sure if l2_reg should be 1x1 or 1).

1 Like

yes you can do this too.

2 Likes

I get the following error while trying this

File “…/lib/python3.6/site-packages/torch/autograd/variable.py”, line 146, in backward
self._execution_engine.run_backward((self,), (gradient,), retain_variables)
File “…/lib/python3.6/site-packages/torch/autograd/_functions/reduce.py”, line 200, in backward
return input.mul(grad_output[0] / self.norm)
ZeroDivisionError: float division by zero

Any suggestions ? (I guess its somehow getting self.norm as 0).

As a workaround, I just ensure the l2 norm of the weights is not 0 after initialization (which should be handled in the code I think).

Since the parameters are Variables, won’t l2_reg be automatically converted to a Variable at the end? I’m using l2_reg=0 and it seems to work.
Also I’m not sure if OP’s formula for L2 reg is correct. You need the sum of every parameter element squared.

1 Like

W.norm(2) should be W.norm(2)**2, no?

2 Likes

I Im missing a square, W.norm(2) should be W.norm(2)**2, no?

Yeah; but it’s probably more efficient to just do torch.pow(W, 2).sum(). Also you may want to multiply by 0.5 as a standard convention.

3 Likes

Further Printing the the value returned by the function, it surely seems to be going towards INF. But almost the same logic for custom regularization, works in tensorflow. Am I doing something wrong code wise.

Epoch: [0][0/391] Time 6.739 (6.739) Loss 3157501.5000 (3157501.5000) Prec@1 10.938 (10.938)
Variable containing:
2.7529e+08
[torch.cuda.FloatTensor of size 1 (GPU 0)]

Variable containing:
4.1735e+12
[torch.cuda.FloatTensor of size 1 (GPU 0)]

Variable containing:
5.0968e+27
[torch.cuda.FloatTensor of size 1 (GPU 0)]
…

Then the error.

Thanks!

I was looking for how to add a L2 norm of a parameter to the loss function. And I did as suggested above, however, there comes a bug:

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

How should I solve this problem? Thank you very much.

You are backproping through the same graph multiple times. Make sure that it is desired because most of the cases you don’t need to. If so specify the flag retain_graph. If not, find out where you are backproping more than once and fix it.

1 Like

Is it possible to replace

batch_loss = (1/N_train)*(y_pred - batch_ys).pow(2).sum() + l2_reg * reg_lambda

with

batch_loss = MSEloss(y_pred, batch_ys) + l2_reg * reg_lambda

Thanks!

1 Like

Won’t this penalize the bias terms as well?

yes, it will penalize the bias terms. if you want it to not penalize the bias terms, you can easily filter them out by using the model.named_parameters() call, and not invoking the regularizer for the bias named terms.

3 Likes

I noticed that scaling by 0.5 is also necessary

reg_loss = None
    for param in model.parameters():
        if reg_loss is None:
            reg_loss = 0.5 * torch.sum(param**2)
        else:
            reg_loss = reg_loss + 0.5 * param.norm(2)**2

    loss += lmbd * reg_loss

Full code:

import torch

torch.manual_seed(1)

N, D_in, H, D_out = 10, 5, 5, 1
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

criterion = torch.nn.MSELoss()
lr = 1e-4
weight_decay = 0  # for torch.optim.SGD
lmbd = 0.9  # for custom L2 regularization

optimizer = torch.optim.SGD(model.parameters(), lr=lr, weight_decay=weight_decay)

for t in range(100):
    y_pred = model(x)

    # Compute and print loss.
    loss = criterion(y_pred, y)

    optimizer.zero_grad()

    reg_loss = None
    for param in model.parameters():
        if reg_loss is None:
            reg_loss = 0.5 * torch.sum(param**2)
        else:
            reg_loss = reg_loss + 0.5 * param.norm(2)**2

    loss += lmbd * reg_loss

    loss.backward()

    optimizer.step()

for name, param in model.named_parameters():
    print(name, param)

2 Likes

Can I do weight normailzation for conv layers as follows?:

conv.weight+=lamda*(conv.weight**2)

Can you provide some intuition as to why we should scale by 0.5? As it is stated now it is creating some confusion.

It is not necessary in general, I noticed that this way it will be compatible with Pytorch implementation.