How the pytorch freeze network in some layers, only the rest of the training?

Thanks for the clarification!
As a follow-up to my point regarding the speed up - I am not observing a speedup when I freeze the initial 80% of the network. I expected the training to be faster, since it only has to update 20% and to be lighter, since it only has to store the information to execute a backward for 20% of the network.
Is there a speedup expected in this scenario?

I’m afraid I have no definitive answer for this since I don’t know your exact model setup, but several suggestions:

  1. Every single tensor before the frozen part in the computational graph must also be requires_grad=False so that the frozen subgraph gets excluded in the autograd engine. If there exists any tensor that requires grad, It’ll need all the backward pass of the graph anyway.
  2. I would check which part of your model is a major bottleneck in speed. For example, if the unfrozen parts may contain parameters that require heavy-lifting like this example. Or, the bottleneck would be unrelated to the model itself, such as loading your data that overshadows the model speed.
  3. Make sure that you use cuda.synchronize() for properly measuring the speed.
1 Like

Have you solved this problem?

@lugiavn’s method should work.
As long as you need to compute d(B)/d(params of A), you have to backpropagate gradient along paths in B to A. requires_grad attribute has to be set for B. Maybe writing custom backward function for B will be more a efficient way, it’ll be constant as there is no parameter updating.

I just experimenting with several things and I found that the B network weight should be requires_grad = False, but the tensor should be always requires_grad = True. and for the optimizer, just optimize the A networks. It works fine for me. if you disable the grad, in the end, you need to enable again before calculating the loss, because the backpropagation need the gradient history. I don’t know the exact solution but it works fine for me and exactly what I want.

Thanks for your detailed explanation.

But this does not make sense to me.
Could you have a further explanation on network weight’s requires_grad and tensor’s requires_grad?

Suppoose we have a network B part we want to freeze.

for param in networkB.conv1.parameters():
    param.requires_grad = False

For the tensor, we can set it while creating the tensor. you can see details in here.
x = torch.tensor([1], requires_grad=True)

1 Like

so it is the same thing if we dont use the filter? i mean the output will be the same right?