I have some confusion regarding the correct way to freeze layers.
Suppose I have the following NN: layer1, layer2, layer3
I want to freeze the weights of layer2, and only update layer1 and layer3.
Based on other threads, I am aware of the following ways of achieving this goal.
Method 1:
- optim = {layer1, layer3}
- compute loss
- loss.backward()
- optim.step()
Method 2:
- layer2_requires_grad=False
- optim = {all layers with requires_grad = True}
- compute loss
- loss.backward()
- optim.step()
Method 3:
- optim = {layer1, layer2, layer3}
- layer2_old_weights = layer2.weight (this saves layer2 weights to a variable)
- compute loss
- loss.backward()
- optim.step()
- layer2.weight = layer2_old_weights (this sets layer2 weights to old weights)
Method 4:
- optim = {layer1, layer2, layer3}
- compute loss
- loss.backward()
- set layer2 gradients to 0
- optim.step()
My questions:
- Should we get different results for each method?
- Is any of these methods wrong?
- Is there a preferred method?
14 Likes
I would like to do it the following way -
# we want to freeze the fc2 layer this time: only train fc1 and fc3
net.fc2.weight.requires_grad = False
net.fc2.bias.requires_grad = False
# passing only those parameters that explicitly requires grad
optimizer = optim.Adam(filter(lambda p: p.requires_grad, net.parameters()), lr=0.1)
# then do the normal execution of loss calculation and backward propagation
# unfreezing the fc2 layer for extra tuning if needed
net.fc2.weight.requires_grad = True
net.fc2.bias.requires_grad = True
# add the unfrozen fc2 weight to the current optimizer
optimizer.add_param_group({'params': net.fc2.parameters()})
18 Likes
@kelam_goutam I believe your way is the same as Method 2 described above. Can you please explain why you prefer this over others?
I feel method 3 and 4 are waste of computation. Why to compute gradients for the layers which you dont want to update. I think method 1 would be ideal as in method 2 we need to explicitly mark the layers parameters as False and then its our responsibility to mark them as True if we need to update those layers. However I preferred the Method 2 thinking that using this way it was easier to freeze weights of any layer in case of a huge network as optimizer will automatically gather which all layers to update.
warm regards,
Goutam Kelam
1 Like
If you can change the contents of forward method of a layer, you can use self.eval() and with torch.no_grad():
if self.frozen_pretrained_weights:
self.eval()
with torch.no_grad():
output = self.encoders(batch)
else:
output = self.encoders(batch)