Correct way to freeze layers

I have some confusion regarding the correct way to freeze layers.
Suppose I have the following NN: layer1, layer2, layer3
I want to freeze the weights of layer2, and only update layer1 and layer3.
Based on other threads, I am aware of the following ways of achieving this goal.

Method 1:

  • optim = {layer1, layer3}
  • compute loss
  • loss.backward()
  • optim.step()

Method 2:

  • layer2_requires_grad=False
  • optim = {all layers with requires_grad = True}
  • compute loss
  • loss.backward()
  • optim.step()

Method 3:

  • optim = {layer1, layer2, layer3}
  • layer2_old_weights = layer2.weight (this saves layer2 weights to a variable)
  • compute loss
  • loss.backward()
  • optim.step()
  • layer2.weight = layer2_old_weights (this sets layer2 weights to old weights)

Method 4:

  • optim = {layer1, layer2, layer3}
  • compute loss
  • loss.backward()
  • set layer2 gradients to 0
  • optim.step()

My questions:

  1. Should we get different results for each method?
  2. Is any of these methods wrong?
  3. Is there a preferred method?

I would like to do it the following way -

# we want to freeze the fc2 layer this time: only train fc1 and fc3
net.fc2.weight.requires_grad = False
net.fc2.bias.requires_grad = False

# passing only those parameters that explicitly requires grad
optimizer = optim.Adam(filter(lambda p: p.requires_grad, net.parameters()), lr=0.1)

# then do the normal execution of loss calculation and backward propagation

#  unfreezing the fc2 layer for extra tuning if needed
net.fc2.weight.requires_grad = True
net.fc2.bias.requires_grad = True

# add the unfrozen fc2 weight to the current optimizer
optimizer.add_param_group({'params': net.fc2.parameters()})

@kelam_goutam I believe your way is the same as Method 2 described above. Can you please explain why you prefer this over others?

I feel method 3 and 4 are waste of computation. Why to compute gradients for the layers which you dont want to update. I think method 1 would be ideal as in method 2 we need to explicitly mark the layers parameters as False and then its our responsibility to mark them as True if we need to update those layers. However I preferred the Method 2 thinking that using this way it was easier to freeze weights of any layer in case of a huge network as optimizer will automatically gather which all layers to update.

warm regards,
Goutam Kelam

1 Like

If you can change the contents of forward method of a layer, you can use self.eval() and with torch.no_grad():

if self.frozen_pretrained_weights:
    with torch.no_grad():
      output = self.encoders(batch)
    output = self.encoders(batch)