Freezing intermediate layers while training top and bottom layers

I am trying to understand how to get the “freeze” weights functionality work. Suppose I have a multi-layer network: x --> L1 --> L2 --> L3 --> y . (where x is input and y is output)
Suppose I want to freeze L2 layer in pytorch (only the L2, keeping L1 and L3 trainable).
As discussed in [1] and bunch other posts, I simply set requires_grad=False for all params in L2, which disables gradient computation for L2 params. The observation there is, they are making layers close to the loss_function as trainable and layers away from loss_func are frozen (i.e. all trainable params are connected to the loss in the gradient graph,)

In my case, if I freeze L2, does L1 still get trained in the back prop? If yes, how?

[1] How the pytorch freeze network in some layers, only the rest of the training?

maybe, in my case, I should not be setting requires_grad=False to the L2 parameters, instead I must exclude all L2 parameters from optimizer. That way, right amount of gradients will flow back to L1’s params, but optimizer does not update L2 parameters (which is analogous to freezing L2, yet keeping L1 trainable)

Is this a correct approach :slight_smile: ?

Sorry to not answer your question, but I’m struggling to reconcile this as well. I asked about it here and here

One of the simple thing you can try is just not include L2 layer in the optimizer, so the gradients will still be computed but it will not update the parameters. Here I presume that you are not using batch-norm or dropout.

See per parameter options

Thanks for the links :+1: I will keep an eye on those threads as well.

Hey t.g., I reached to a fellow named Elliot Waite on YoutTube and his video and answer to my comment very much clarified things for me.

In summary, think of setting requires_grad = False as a way of telling the L2’s parameters to not care about gradients (since the gradients are 0, it will never update). Now, what I was missing, is that it helps out greatly to think of L2 as a bunch of input parameters (those which you are freezing) connected to a bunch of computation nodes, and some of those computation nodes will have the output of L1 as another input. When you set the requires_grad = False, you are NOT freezing the gradients on the computation blocks, therefore the necessary gradients still get passed backward to L1. Elliot pointed out that you will get an error if you try to set requires_grad on a computation.

1 Like

Hi Kale, thanks, just caught with your discussion on youtube comment thread and I am
glad to learn that there is a checker built in pytorch that raises Error if we try to set requires_grad=False for interim nodes.

In summary:

  1. we do not (and can not) set requires_grad=False on intermediate layers that we want to freeze because their gradients are needed for calculating gradients of other connected nodes (i.e. the definition of a leaf node in grad network).
  2. but intermediate nodes that we want to freeze can be excluded from the optimizer.

So,


is the right approach!

let me go and try this stuff.
Cheers and thanks for the help,
TG

1 Like

I must exclude all L2 parameters from optimizer

That is sufficient. But from a memory/time standpoint, the recommended practice would be to set requires_grad=False as well. Remember, you’re only freezing the weights (parameters) of L2, not the intermediate computations in that network. When backprop does its thing, the gradients of the loss function WRT the weights (parameters) of L2 are never required for updating the weights of L1 or L3.

I think you and I had the same misunderstanding of requires_grad=False. It helped me a lot to watch Elliot’s video and visualize the intermediate layer (i.e. L2) not as a lumped network, but as a connection of parameters being fed to computations.

1 Like