How to change the loss function for different layers

Hi,

I am trying to recreate this paper: Continuous Learning in Single Incremental Tasks

And one of the algorithms in the paper involves training the final layer w.r.t to a loss function (eg. cross-entropy), but learning all preceding layers using a regularization term in addition to the back-propagated loss. So I know that adding a regularization term for the entire network is done by just adding the term as a Variable to the loss functions value during the step:

loss += regularization_term
loss.backward()
opt.step()

but according to my understanding that modifies the loss for the entire network, thus am stuck on how to change the loss only for all layers except the final layer.

Any advice and help in this manner would be very helpful.

1 Like

Not necessarily. The regularization term will modify all parameters which are is its computation graph.
Have a look at this dummy example:

model = nn.Sequential(
    nn.Linear(1, 1, bias=False),
    nn.Sigmoid(),
    nn.Linear(1, 1, bias=False)
)

criterion = nn.MSELoss()
x = torch.randn(1, 1)
target = torch.ones(1, 1)

output = model(x)
loss = criterion(output, target)
loss.backward()

lin1_grad = model[0].weight.grad.clone()
lin2_grad = model[2].weight.grad.clone()
print('Before regularization')
print('Grad lin1: {}'.format(lin1_grad))
print('Grad lin2: {}'.format(lin2_grad))

# Add regularization to lin1
model.zero_grad()
output = model(x)
loss = criterion(output, target)
loss = loss + torch.norm(model[0].weight)
loss.backward()

lin1_grad_reg = model[0].weight.grad.clone()
lin2_grad_reg = model[2].weight.grad.clone()
print('After regularization')
print('Grad lin1: {}'.format(lin1_grad_reg))
print('Grad lin2: {}'.format(lin2_grad_reg))

As you can see, the gradient for lin2 stays the same, while the gradient for lin1 changes.
This is due to the fact, that the parameters of lin2 were not involved in creating the regularization term, so they won’t be touched.

1 Like

Thanks!
wow! i had no idea that the loss would retain knowledge of which layer’s parameters were used to compute the loss. Pytorch’s computation graphs are quite amazing!

Thanks again!