Adding L1/L2 regularization in a Convolutional Networks in PyTorch?

I am new to pytorch and would like to add an L1 regularization after a layer of a convolutional network. However, I do not know how to do that.

The architecture of my network is defined as follows:

downconv = nn.Conv2d(outer_nc, inner_nc, kernel_size=4,
                         stride=2, padding=1, bias=use_bias)
downrelu = nn.LeakyReLU(0.2, True)
downnorm = norm_layer(inner_nc)
uprelu = nn.ReLU(True)
upnorm = norm_layer(outer_nc)
upconv = nn.ConvTranspose2d(inner_nc, outer_nc,
                                    kernel_size=4, stride=2,
                                    padding=1, bias=use_bias)
down = [downrelu, downconv]
up = [uprelu, upconv, upnorm]
model = down + up

And I used the following code to implement the regularizer:

xx = nn.Parameter(torch.from_numpy(np.ones((3,3))))
target = Variable(torch.from_numpy(np.zeros((3,3))))
l1_crit = nn.L1Loss()
l1_crit(xx, target)

l1_crit = nn.L1Loss(size_average=False)
reg_loss = 0
for param in model:
    print "PARAM: ", param
 reg_loss += l1_crit(param)

model = down + [submodule] + up + [nn.Dropout(0.5)]

But I get the following error:
TypeError: forward() takes exactly 3 arguments (2 given)

I think the problem is in my code that implements the L1 regularization. Can someone help me?

1 Like

You just input param and size_average in reg_loss+=l1_crit(param) without target.

You could implement L! regularization using something like example of L2 regularization.
For L1 regularization, you should change W.norm(2) to W.norm(p=1).

1 Like

Since the L1 regularizer is not differentiable everywhere, what does PyTorch do when it encounters differentiating this functions? A simple example shows PyTorch returns zero.

import torch

x = torch.linspace(-1.0, 1.0, 5, requires_grad=True)
y = torch.abs(x)
y[2].backward()

print(x.grad)

tensor([-0., -0., 0., 0., 0.])

Why is this the case?

I think a zero gradient is expected to be returned for zero inputs and would fit the idea of a regularizer since no penalty should be added to a value which is already at zero. All other values will get valid gradients:

x = torch.linspace(-1.0, 1.0, 5, requires_grad=True)
y = torch.norm(x, 1)
y.backward()

print(x)
# tensor([-1.0000, -0.5000,  0.0000,  0.5000,  1.0000], requires_grad=True)
print(x.grad)
# tensor([-1., -1.,  0.,  1.,  1.])
1 Like

That makes sense. Is PyTorch using a specific algorithm to compute the gradient at this non-differentiable point? Is there an academic reference that discusses this behaviour that I can read?

The derivative is implemented here as:

self: grad * self.sgn()

and with:

torch.tensor(0.).sgn()
# tensor(0.)
torch.tensor(1.).sgn()
# tensor(1.)
torch.tensor(-1.).sgn()
# tensor(-1.)

would then return a zero.
I don’t know how it was defined, but based on e.g. this answer the zero output might be “convenient” for users.

1 Like