How does one implement Weight regularization (l1 or l2) manually without optimum?

Brando_Miranda · September 27, 2017, 10:40pm

I wanted to do it manually so I implemented it as follows:

reg_lambda=1.0
l2_reg=0
for W in mdl.parameters():
    l2_reg += *W.norm(2)
batch_loss = (1/N_train)*(y_pred - batch_ys).pow(2).sum() + reg_lambda*l2_reg
## BACKARD PASS
batch_loss.backward() # Use autograd to compute the backward pass. Now w will have gradients

is this correct? the key part I care about is that the SGD update works correctly. i.e.:

## SGD update
for W in mdl.parameters():
    delta = eta*W.grad.data
    W.data.copy_(W.data - delta) # W - eta*g + A*gdl_eps

has the 2w term in the SGD update.

here is a related question: Simple L2 regularization?

smth · September 28, 2017, 2:37am

it’s almost correct.

l2_reg here is a python scalar, so operations done on it are not recorded for the autograd backward().
Instead, you should make l2_reg to be an autograd Variable.

l2_reg = None
for W in mdl.parameters():
    if l2_reg is None:
        l2_reg = W.norm(2)
    else:
        l2_reg = l2_reg + W.norm(2)
batch_loss = (1/N_train)*(y_pred - batch_ys).pow(2).sum() + l2_reg * reg_lambda
batch_loss.backward()

Brando_Miranda · September 28, 2017, 4:18pm

Why not:

    l2_reg = Variable( torch.FloatTensor(1), requires_grad=True)
    for W in mdl.parameters():
        l2_reg = l2_reg + W.norm(2)

(not sure if l2_reg should be 1x1 or 1).

smth · September 28, 2017, 4:44pm

yes you can do this too.

jaiabhayk · November 10, 2017, 6:15pm

I get the following error while trying this

File “…/lib/python3.6/site-packages/torch/autograd/variable.py”, line 146, in backward
self._execution_engine.run_backward((self,), (gradient,), retain_variables)
File “…/lib/python3.6/site-packages/torch/autograd/_functions/reduce.py”, line 200, in backward
return input.mul(grad_output[0] / self.norm)
ZeroDivisionError: float division by zero

Any suggestions ? (I guess its somehow getting self.norm as 0).

jaiabhayk · November 10, 2017, 7:33pm

As a workaround, I just ensure the l2 norm of the weights is not 0 after initialization (which should be handled in the code I think).

yibo · December 26, 2017, 7:11pm

Since the parameters are Variables, won’t l2_reg be automatically converted to a Variable at the end? I’m using l2_reg=0 and it seems to work.
Also I’m not sure if OP’s formula for L2 reg is correct. You need the sum of every parameter element squared.

Brando_Miranda · December 29, 2017, 7:18pm

W.norm(2) should be W.norm(2)**2, no?

Brando_Miranda · December 29, 2017, 7:18pm

I Im missing a square, W.norm(2) should be W.norm(2)**2, no?

yibo · December 29, 2017, 9:23pm

Yeah; but it’s probably more efficient to just do torch.pow(W, 2).sum(). Also you may want to multiply by 0.5 as a standard convention.

nbansal90 · January 17, 2018, 7:13am

Further Printing the the value returned by the function, it surely seems to be going towards INF. But almost the same logic for custom regularization, works in tensorflow. Am I doing something wrong code wise.

Epoch: [0][0/391] Time 6.739 (6.739) Loss 3157501.5000 (3157501.5000) Prec@1 10.938 (10.938)
Variable containing:
2.7529e+08
[torch.cuda.FloatTensor of size 1 (GPU 0)]

Variable containing:
4.1735e+12
[torch.cuda.FloatTensor of size 1 (GPU 0)]

Variable containing:
5.0968e+27
[torch.cuda.FloatTensor of size 1 (GPU 0)]
…

Then the error.

Thanks!

szhang · April 5, 2018, 7:39pm

I was looking for how to add a L2 norm of a parameter to the loss function. And I did as suggested above, however, there comes a bug:

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

How should I solve this problem? Thank you very much.

SimonW · April 6, 2018, 5:05am

You are backproping through the same graph multiple times. Make sure that it is desired because most of the cases you don’t need to. If so specify the flag retain_graph. If not, find out where you are backproping more than once and fix it.

com · May 5, 2018, 11:10am

Is it possible to replace

batch_loss = (1/N_train)*(y_pred - batch_ys).pow(2).sum() + l2_reg * reg_lambda

with

batch_loss = MSEloss(y_pred, batch_ys) + l2_reg * reg_lambda

Thanks!

Jared_77 · June 22, 2018, 10:04pm

Won’t this penalize the bias terms as well?

smth · June 23, 2018, 5:29am

yes, it will penalize the bias terms. if you want it to not penalize the bias terms, you can easily filter them out by using the model.named_parameters() call, and not invoking the regularizer for the bias named terms.

torayeff · December 6, 2018, 11:57pm

I noticed that scaling by 0.5 is also necessary

reg_loss = None
    for param in model.parameters():
        if reg_loss is None:
            reg_loss = 0.5 * torch.sum(param**2)
        else:
            reg_loss = reg_loss + 0.5 * param.norm(2)**2

    loss += lmbd * reg_loss

Full code:

import torch

torch.manual_seed(1)

N, D_in, H, D_out = 10, 5, 5, 1
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

criterion = torch.nn.MSELoss()
lr = 1e-4
weight_decay = 0  # for torch.optim.SGD
lmbd = 0.9  # for custom L2 regularization

optimizer = torch.optim.SGD(model.parameters(), lr=lr, weight_decay=weight_decay)

for t in range(100):
    y_pred = model(x)

    # Compute and print loss.
    loss = criterion(y_pred, y)

    optimizer.zero_grad()

    reg_loss = None
    for param in model.parameters():
        if reg_loss is None:
            reg_loss = 0.5 * torch.sum(param**2)
        else:
            reg_loss = reg_loss + 0.5 * param.norm(2)**2

    loss += lmbd * reg_loss

    loss.backward()

    optimizer.step()

for name, param in model.named_parameters():
    print(name, param)

brijeshiitg · August 23, 2019, 9:40am

Can I do weight normailzation for conv layers as follows?:

conv.weight+=lamda*(conv.weight**2)

Esteban_Lanter · October 13, 2019, 11:53am

Can you provide some intuition as to why we should scale by 0.5? As it is stated now it is creating some confusion.

torayeff · October 13, 2019, 12:15pm

It is not necessary in general, I noticed that this way it will be compatible with Pytorch implementation.