Best practice for freezing layers?

There are many posts asking how to freeze layer, but the different authors have a somewhat different approach. Most of the time I saw something like this:

Imagine we have a nn.Sequential and only want to train the last layer:

for parameter in model.parameters():
    parameter.requires_grad = False
for parameter in model[-1].parameters():
    parameter.requires_grad = True
optimizer = optim.SGD(model.parameters(), lr=1e0)

I think it is much cleaner to solve this like this:

optimizer = optim.SGD(model[-1].parameters(), lr=1e0)

I ran some simple tests and both methods yield the same results.

  1. Is my method functionally equivalent or did I miss something important?
  2. Is the first method an agreed upon best practice or can I use my method?

Hi, consider that if you only pass the desired parameter into the optimizer but nothing else, you are only pdating that parameter which is, indeed, equivalent to freeze that layer. However, you aren’t zeroing gradients for the other layers but accumulating them as they arent affected by optimizer.zero_grad().

You can follow your approach if you set to zero gradients (either manually or using a second optimizer). Otherwise, the first approach is the proper one.


So with my method I have a significant overhead, since the gradients are calculated anyways, whereas with the first approach only calculates the gradients for the layer I want to train?

Both calculate gradients.
In the 1st case weights aren’t updated because optimizer checks requires_grad in order to whether update them or not. However, gradients are zeroed as they have been passed to the optimizer.

In the 2nd case weights aren’t updated because they haven’t been passed to the optimizer.
Here, gradients aren’t zeroed as they haven’t been passed to the optimizer. This is ok if you don’t have previous trainable parameters, otherwise they will be affected by this gradient accumulation.

Realize that, even if you set requires_grad = False, you may need those gradients to backprop through previous trainable layers. That is why gradients are computed in both cases.


In other words, while the parameters I didn’t pass to the optimizer are indeed static, the gradient update of the trainable parameters will get wrong after the initial step, since its never been cleared?

The passed parameters will get the right gradients and will be updated properly. However, as @JuanFMontesinos described, the non-frozen parameters will accumulate the gradients (without an update in your use case). Should you change the optimizer and suddenly want to update all parameters, this update might be wrong. Note that this is an edge case.

I would recommend to freeze the unwanted parameters to potentially save memory and computation. If you only need to update the last layer, all previous gradients won’t be calculated, if they are not needed.


how woul you use this approach if you wanted to stop the whole model from updating, which i assume effectively would stop that model from being trainable…

If you don’t want to update any parameters in the model and thus don’t want to create the computation graph, you can wrap the forward pass into a with torch.no_grad() block, which would raise an error, if you try to call backward() on the loss.

Hi Ptrblck,

thats what i thought. Thanks for this.

based on my understanding, if all leaves nodes (which are usually parameters and inputs) have requires_grad=False, then all intermediate nodes will have requires_grad=False and therefore, the intermediate nodes are also considered leaves as well. So, no gradian is computed.

Furthermore, even if some nodes have requires_grad = True, only the gradient is computed for the necessary intermediate nodes that are connected to the nodes with requires_grad = True.

So, in many models (at least sequential ones) no gradient is computed if requires_grad=False for weights and inputs.

My understanding comes from:

Please let me know if I miss understood something.