Best practice for freezing layers?

There are many posts asking how to freeze layer, but the different authors have a somewhat different approach. Most of the time I saw something like this:

Imagine we have a nn.Sequential and only want to train the last layer:

for parameter in model.parameters():
    parameter.requires_grad = False
for parameter in model[-1].parameters():
    parameter.requires_grad = True
optimizer = optim.SGD(model.parameters(), lr=1e0)

I think it is much cleaner to solve this like this:

optimizer = optim.SGD(model[-1].parameters(), lr=1e0)

I ran some simple tests and both methods yield the same results.


  1. Is my method functionally equivalent or did I miss something important?
  2. Is the first method an agreed upon best practice or can I use my method?
6 Likes

Hi, consider that if you only pass the desired parameter into the optimizer but nothing else, you are only pdating that parameter which is, indeed, equivalent to freeze that layer. However, you aren’t zeroing gradients for the other layers but accumulating them as they arent affected by optimizer.zero_grad().

You can follow your approach if you set to zero gradients (either manually or using a second optimizer). Otherwise, the first approach is the proper one.

6 Likes

So with my method I have a significant overhead, since the gradients are calculated anyways, whereas with the first approach only calculates the gradients for the layer I want to train?

Both calculate gradients.
In the 1st case weights aren’t updated because optimizer checks requires_grad in order to whether update them or not. However, gradients are zeroed as they have been passed to the optimizer.

In the 2nd case weights aren’t updated because they haven’t been passed to the optimizer.
Here, gradients aren’t zeroed as they haven’t been passed to the optimizer. This is ok if you don’t have previous trainable parameters, otherwise they will be affected by this gradient accumulation.

Realize that, even if you set requires_grad = False, you may need those gradients to backprop through previous trainable layers. That is why gradients are computed in both cases.

2 Likes

In other words, while the parameters I didn’t pass to the optimizer are indeed static, the gradient update of the trainable parameters will get wrong after the initial step, since its never been cleared?

The passed parameters will get the right gradients and will be updated properly. However, as @JuanFMontesinos described, the non-frozen parameters will accumulate the gradients (without an update in your use case). Should you change the optimizer and suddenly want to update all parameters, this update might be wrong. Note that this is an edge case.

I would recommend to freeze the unwanted parameters to potentially save memory and computation. If you only need to update the last layer, all previous gradients won’t be calculated, if they are not needed.

5 Likes

how woul you use this approach if you wanted to stop the whole model from updating, which i assume effectively would stop that model from being trainable…

If you don’t want to update any parameters in the model and thus don’t want to create the computation graph, you can wrap the forward pass into a with torch.no_grad() block, which would raise an error, if you try to call backward() on the loss.

Hi Ptrblck,

thats what i thought. Thanks for this.

Hi,
based on my understanding, if all leaves nodes (which are usually parameters and inputs) have requires_grad=False, then all intermediate nodes will have requires_grad=False and therefore, the intermediate nodes are also considered leaves as well. So, no gradian is computed.

Furthermore, even if some nodes have requires_grad = True, only the gradient is computed for the necessary intermediate nodes that are connected to the nodes with requires_grad = True.

So, in many models (at least sequential ones) no gradient is computed if requires_grad=False for weights and inputs.

My understanding comes from:

https://pytorch.org/docs/stable/notes/autograd.html

Please let me know if I miss understood something.

Thanks

My results are confusing me. I have a straightforward autoencoder, which has been trained, for the purpose of providing a trained encoder for other uses. I load the weights, etc, set requires_grad False and use it - allowing the optimizer to see the frozen layers. Does not work well! however, when I remove the pretrained layers from the optimizer, it works as expected - fast, and great results. Not sure what I am missing … should I ALSO use with torch.no_grad() on the forward() portion of the frozen layers?

No, using no_grad() for a forward pass which only uses frozen parameters will not change anything.
Did you freeze the parameters after they have been updated? In this case the parameters could still be updated by the optimizer even with a zero gradient if the optimizer tracks running stats of the previous updates (e.g. Adam).

I think I have solved this - crashed CUDA after ubuntu update. Thanks for quick responses …

seth

@ptrblck in answer to your question: no, params are frozen after being loaded in a new model

In that case the optimizer should not change them so could you post a minimal, executable code snippet which would reproduce the issue, please?

Sorry again if I was unclear. I am unable to reproduce it. The problem disappeared after I updated CUDA following an Ubuntu update. I had other issues which led me to NVIDIA driver update, and then a cuda refresh. Sorry for the wasted cycles. I was running ddp on 2 gpus and can only assume that something had crashed. If it reappears, I will post it

I stumbled upon a way to recreate this problem. As part of refactoring, I moved the loading of weights and freezing params into the latent module init(), and that does it. Snippet in the next few hours …

THIS IS A RED HERRING - DUE TO A BUG IN MY CODE. PLEASE IGNORE

I have built something that demonstrates a version of the problem. It is hardly a snippet - nearly 300 lines of python, plus a .pt file for the encoder weights. The symptoms are not as bad as my full size model, but the differences in training results are apparent. My ‘real’ version is ddp on 2 gpus using pytorch-lightning. The demonstration version is single gpu pytorch only.

It seems plain to me that this is not an optimizer issue. This looks like a weights initialization sequencing issue.

In all cases the pretrained weights are loaded before the optimizer (adam, in my case) is created or run.
Things work fine if the pretrained encoder weights are loaded at the end of the entire model constructor.
The problematic case is if the pretrained weights are loaded at the end of the encoder constructor, but before the entire model is constructor is finished.

Should this be a problem? If the 2 cases should train the same, I will create a new issue, describe the problem again, and post my mega-snippet.
If the difference is expected, I’d appreciate a brief explanation so I can load my weights at the best moment.

Again, thanks …

seth

It’s a bit unclear what exactly is happening in your code now and from your description it sounds as if you expect a specific order of random numbers so depend on seeding and the state of the pseudorandom number generator. If so, you can easily run into issues where calls to the PRNG are used which would then change the next drawn samples.
I can’t speculate on the current issue as the root cause was changed a few times already, so feel free to post a minimal, executable code snippet showing the new issue.

I appreciate your willingness to take a look, particularly given my public thrashing around this bug in my code.

However, I don’t wanna waste your time. You are too valuable a resource to the community. And there is plainly no problem with the optimizer or weight loading issues.

The small results changes between sending the frozen parameters to the optimizer/not sending them are well within the expected random number generation issues, even with good manual seeding practices to force consistency. The deltas are all beyond the sixth decimal place.

The actual problem I was chasing - for the record - turned out to be a bad git check in from a team member. The code was attempting to load the state dict with a .pt file saved from a different level of the model hierarchy, unfortunately with strict=False. Of course, it silently failed, had uninitialized weights, and requires_grad set to False. No wonder it didn’t train properly.

Sorry for the public thrashing, and thanks for your persistent efforts to help.

1 Like