# How does one implement Weight regularization (l1 or l2) manually without optimum?

I wanted to do it manually so I implemented it as follows:

``````reg_lambda=1.0
l2_reg=0
for W in mdl.parameters():
l2_reg += *W.norm(2)
batch_loss = (1/N_train)*(y_pred - batch_ys).pow(2).sum() + reg_lambda*l2_reg
## BACKARD PASS
batch_loss.backward() # Use autograd to compute the backward pass. Now w will have gradients
``````

is this correct? the key part I care about is that the SGD update works correctly. i.e.:

``````## SGD update
for W in mdl.parameters():
W.data.copy_(W.data - delta) # W - eta*g + A*gdl_eps
``````

has the `2w` term in the SGD update.

here is a related question: Simple L2 regularization?

3 Likes

itâ€™s almost correct.

`l2_reg` here is a python scalar, so operations done on it are not recorded for the autograd backward().

``````l2_reg = None
for W in mdl.parameters():
if l2_reg is None:
l2_reg = W.norm(2)
else:
l2_reg = l2_reg + W.norm(2)
batch_loss = (1/N_train)*(y_pred - batch_ys).pow(2).sum() + l2_reg * reg_lambda
batch_loss.backward()
``````
14 Likes

Why not:

``````    l2_reg = Variable( torch.FloatTensor(1), requires_grad=True)
for W in mdl.parameters():
l2_reg = l2_reg + W.norm(2)
``````

(not sure if `l2_reg` should be 1x1 or 1).

1 Like

yes you can do this too.

2 Likes

I get the following error while trying this

File â€śâ€¦/lib/python3.6/site-packages/torch/autograd/variable.pyâ€ť, line 146, in backward
File â€śâ€¦/lib/python3.6/site-packages/torch/autograd/_functions/reduce.pyâ€ť, line 200, in backward
ZeroDivisionError: float division by zero

Any suggestions ? (I guess its somehow getting self.norm as 0).

As a workaround, I just ensure the l2 norm of the weights is not 0 after initialization (which should be handled in the code I think).

Since the parameters are Variables, wonâ€™t l2_reg be automatically converted to a Variable at the end? Iâ€™m using l2_reg=0 and it seems to work.
Also Iâ€™m not sure if OPâ€™s formula for L2 reg is correct. You need the sum of every parameter element squared.

1 Like

W.norm(2) should be W.norm(2)**2, no?

2 Likes

I Im missing a square, W.norm(2) should be W.norm(2)**2, no?

Yeah; but itâ€™s probably more efficient to just do `torch.pow(W, 2).sum()`. Also you may want to multiply by 0.5 as a standard convention.

3 Likes

Further Printing the the value returned by the function, it surely seems to be going towards INF. But almost the same logic for custom regularization, works in tensorflow. Am I doing something wrong code wise.

Epoch: [0][0/391] Time 6.739 (6.739) Loss 3157501.5000 (3157501.5000) Prec@1 10.938 (10.938)
Variable containing:
2.7529e+08
[torch.cuda.FloatTensor of size 1 (GPU 0)]

Variable containing:
4.1735e+12
[torch.cuda.FloatTensor of size 1 (GPU 0)]

Variable containing:
5.0968e+27
[torch.cuda.FloatTensor of size 1 (GPU 0)]
â€¦

Then the error.

Thanks!

I was looking for how to add a L2 norm of a parameter to the loss function. And I did as suggested above, however, there comes a bug:

``````RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
``````

How should I solve this problem? Thank you very much.

You are backproping through the same graph multiple times. Make sure that it is desired because most of the cases you donâ€™t need to. If so specify the flag `retain_graph`. If not, find out where you are backproping more than once and fix it.

1 Like

Is it possible to replace

``````batch_loss = (1/N_train)*(y_pred - batch_ys).pow(2).sum() + l2_reg * reg_lambda
``````

with

batch_loss = MSEloss(y_pred, batch_ys) + l2_reg * reg_lambda

Thanks!

1 Like

Wonâ€™t this penalize the bias terms as well?

yes, it will penalize the bias terms. if you want it to not penalize the bias terms, you can easily filter them out by using the `model.named_parameters()` call, and not invoking the regularizer for the `bias` named terms.

3 Likes

I noticed that scaling by 0.5 is also necessary

``````reg_loss = None
for param in model.parameters():
if reg_loss is None:
reg_loss = 0.5 * torch.sum(param**2)
else:
reg_loss = reg_loss + 0.5 * param.norm(2)**2

loss += lmbd * reg_loss
``````

Full code:

``````import torch

torch.manual_seed(1)

N, D_in, H, D_out = 10, 5, 5, 1
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
)

criterion = torch.nn.MSELoss()
lr = 1e-4
weight_decay = 0  # for torch.optim.SGD
lmbd = 0.9  # for custom L2 regularization

optimizer = torch.optim.SGD(model.parameters(), lr=lr, weight_decay=weight_decay)

for t in range(100):
y_pred = model(x)

# Compute and print loss.
loss = criterion(y_pred, y)

reg_loss = None
for param in model.parameters():
if reg_loss is None:
reg_loss = 0.5 * torch.sum(param**2)
else:
reg_loss = reg_loss + 0.5 * param.norm(2)**2

loss += lmbd * reg_loss

loss.backward()

optimizer.step()

for name, param in model.named_parameters():
print(name, param)

``````
2 Likes

Can I do weight normailzation for conv layers as follows?:

conv.weight+=lamda*(conv.weight**2)

Can you provide some intuition as to why we should scale by 0.5? As it is stated now it is creating some confusion.

It is not necessary in general, I noticed that this way it will be compatible with Pytorch implementation.