Freezing the parts of the model by passing only certain parameters when creating optimizers?

Hi! I’m trying to freeze parts of the code and I tried the method of only putting in certain parameters when defining the optimizer, which seems to work but I couldn’t find this method anywhere else on the web so was not sure if this method is OK to use… the code is below. any help would be greatly appreciated!

class mymodel(nn.Module):
    def __init__(self):
        super().__init__() 
        self.layer1 = nn.Linear(10, 5)
        self.layer2 = nn.Linear(5,5)
        self.layer3 = nn.Linear(5,5)
        self.act = nn.Sigmoid()
    def forward(self, x):
        x = self.layer1(x) 
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.act(x)
        return x
model = mymodel()
optimizer = optim.Adam(model.parameters(), lr=0.001)

when defining like this, i find that layer 1,2,3 have different weight values, while doing like :

finetune_1_optimizer = optim.Adam(model.layer2.parameters(), lr=0.001)

only changes layer2’s weight values. (albeit the “requires_grad” still being true)

Is this method of freezing parts of the model OK? I though this method wouldn’t work because if I (for example) only put in layer 2 into the optimizer, since the computation graph is still linked, when backprop occurs the grad_fn values in layer 1~3 will all be affected, meaning that all of them could change.

Hi Kore!

Yes, this is a sound way to freeze the weights of the other layers.

It is true that calling .backward() on you loss will cause the gradients of
all of the layers to be populated.

But populating the gradients doesn’t, in and of itselt, cause the weights
to change. It’s the optimizer that changes the weights and because
finetune_1_optimizer has only had layer2’s weights added to it,
calling finetune_1_optimizer.step() will only modify layer2’s weights
and will leave layer1 and layer3 unchanged.

Note, this approach has two consequences:

You will pay the price – which may or may not matter – of backpropagating
through layer1 and updating its gradients. (You do need to backpropagate
through layer3 in order to compute the gradients for layer2.)

loss.backward() accumulates the gradients of all three layers. if you
call finetune_1_optimizer.zero_grad(), it will only set the gradients
for layer2, so you could repeatedly be accumulating into the gradients
for the other two layers without ever zeroing them out. This is unlikely to be
an issue, but it could lead to overflow (that shouldn’t matter, anyway). You
could call model.zero_grad() or optimizer.zero_grad() to zero out
all three layers’ weights.

As an aside, the way your example mymodel is written – without any
intervening nonlinearities between your three Linear layers – those
three layers collapse, in effect, into a single Linear (10, 5) layer. In
a real model you would want some sort of activation, ReLU or Sigmoid,
say, in between the Linear layers.

Best.

K Frank

1 Like

Thank you for your answer @KFrank ! I have a further question if you don’t mind : I found that when freezing the model as my original question, the (supposed to be frozen) variables like running_var of the BatchNorm (which I set as frozen) also changed, which seems like it could be problematic as I do not want the parameters of the BatchNorm to change during evaluation. In this case, should I change the batch norm layers to eval mode? (I was going to optimize the code by only leaving the unfrozen layers to .train() and the rest to .eval() anyway, but I’m just curious if my understanding is correct! :slight_smile: )