How does one implement Weight regularization (l1 or l2) manually without optimum?


(MirandaAgent) #1

I wanted to do it manually so I implemented it as follows:

reg_lambda=1.0
l2_reg=0
for W in mdl.parameters():
    l2_reg += *W.norm(2)
batch_loss = (1/N_train)*(y_pred - batch_ys).pow(2).sum() + reg_lambda*l2_reg
## BACKARD PASS
batch_loss.backward() # Use autograd to compute the backward pass. Now w will have gradients

is this correct? the key part I care about is that the SGD update works correctly. i.e.:

## SGD update
for W in mdl.parameters():
    delta = eta*W.grad.data
    W.data.copy_(W.data - delta) # W - eta*g + A*gdl_eps

has the 2w term in the SGD update.

here is a related question: Simple L2 regularization?


TypeError: unsupported operand type(s) for +: 'MSELoss' and 'float'
#2

it’s almost correct.

l2_reg here is a python scalar, so operations done on it are not recorded for the autograd backward().
Instead, you should make l2_reg to be an autograd Variable.

l2_reg = None
for W in mdl.parameters():
    if l2_reg is None:
        l2_reg = W.norm(2)
    else:
        l2_reg = l2_reg + W.norm(2)
batch_loss = (1/N_train)*(y_pred - batch_ys).pow(2).sum() + l2_reg * reg_lambda
batch_loss.backward()

Adding L1/L2 regularization in a Convolutional Networks in PyTorch?
(MirandaAgent) #3

Why not:

    l2_reg = Variable( torch.FloatTensor(1), requires_grad=True)
    for W in mdl.parameters():
        l2_reg = l2_reg + W.norm(2)

(not sure if l2_reg should be 1x1 or 1).


#4

yes you can do this too.


#5

I get the following error while trying this

File “…/lib/python3.6/site-packages/torch/autograd/variable.py”, line 146, in backward
self._execution_engine.run_backward((self,), (gradient,), retain_variables)
File “…/lib/python3.6/site-packages/torch/autograd/_functions/reduce.py”, line 200, in backward
return input.mul(grad_output[0] / self.norm)
ZeroDivisionError: float division by zero

Any suggestions ? (I guess its somehow getting self.norm as 0).


#6

As a workaround, I just ensure the l2 norm of the weights is not 0 after initialization (which should be handled in the code I think).


#7

Since the parameters are Variables, won’t l2_reg be automatically converted to a Variable at the end? I’m using l2_reg=0 and it seems to work.
Also I’m not sure if OP’s formula for L2 reg is correct. You need the sum of every parameter element squared.


(MirandaAgent) #8

W.norm(2) should be W.norm(2)**2, no?


(MirandaAgent) #9

I Im missing a square, W.norm(2) should be W.norm(2)**2, no?


#10

Yeah; but it’s probably more efficient to just do torch.pow(W, 2).sum(). Also you may want to multiply by 0.5 as a standard convention.


(Nitin Kumar Bansal) #12

Further Printing the the value returned by the function, it surely seems to be going towards INF. But almost the same logic for custom regularization, works in tensorflow. Am I doing something wrong code wise.

Epoch: [0][0/391] Time 6.739 (6.739) Loss 3157501.5000 (3157501.5000) Prec@1 10.938 (10.938)
Variable containing:
2.7529e+08
[torch.cuda.FloatTensor of size 1 (GPU 0)]

Variable containing:
4.1735e+12
[torch.cuda.FloatTensor of size 1 (GPU 0)]

Variable containing:
5.0968e+27
[torch.cuda.FloatTensor of size 1 (GPU 0)]

Then the error.

Thanks!


#14

I was looking for how to add a L2 norm of a parameter to the loss function. And I did as suggested above, however, there comes a bug:

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

How should I solve this problem? Thank you very much.


(Simon Wang) #15

You are backproping through the same graph multiple times. Make sure that it is desired because most of the cases you don’t need to. If so specify the flag retain_graph. If not, find out where you are backproping more than once and fix it.


(com) #16

Is it possible to replace

batch_loss = (1/N_train)*(y_pred - batch_ys).pow(2).sum() + l2_reg * reg_lambda

with

batch_loss = MSEloss(y_pred, batch_ys) + l2_reg * reg_lambda

Thanks!


#17

Won’t this penalize the bias terms as well?


#18

yes, it will penalize the bias terms. if you want it to not penalize the bias terms, you can easily filter them out by using the model.named_parameters() call, and not invoking the regularizer for the bias named terms.


(Agajan Torayev) #19

I noticed that scaling by 0.5 is also necessary

reg_loss = None
    for param in model.parameters():
        if reg_loss is None:
            reg_loss = 0.5 * torch.sum(param**2)
        else:
            reg_loss = reg_loss + 0.5 * param.norm(2)**2

    loss += lmbd * reg_loss

Full code:

import torch

torch.manual_seed(1)

N, D_in, H, D_out = 10, 5, 5, 1
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

criterion = torch.nn.MSELoss()
lr = 1e-4
weight_decay = 0  # for torch.optim.SGD
lmbd = 0.9  # for custom L2 regularization

optimizer = torch.optim.SGD(model.parameters(), lr=lr, weight_decay=weight_decay)

for t in range(100):
    y_pred = model(x)

    # Compute and print loss.
    loss = criterion(y_pred, y)

    optimizer.zero_grad()

    reg_loss = None
    for param in model.parameters():
        if reg_loss is None:
            reg_loss = 0.5 * torch.sum(param**2)
        else:
            reg_loss = reg_loss + 0.5 * param.norm(2)**2

    loss += lmbd * reg_loss

    loss.backward()

    optimizer.step()

for name, param in model.named_parameters():
    print(name, param)