How to train a part of a network

I am building a model which consists of a subnet and some layers followed, the subnet has been trained before, and now I want to train the following new layers with the parameters of the subnet fixed.
I want to know how to remove the parameters from the model.parameters() when optimizing the net using Adam.
Any advice would be appreciated !!!

The transfer learning tutorial has a relevant section:

http://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html#convnet-as-fixed-feature-extractor

To summarize:

  1. Set requires_grad=False for all parameters you do not wish to optimizer. This avoids computing gradients for them:

    for param in base_model.parameters():
     param.requires_grad = False
    
  2. Call .parameters() on the part of the sub-network:

    optim.Adam(model.sub_network.parameters(), ...)
    

If your new layers aren’t entirely contained in a single Module, you can collect parameters by using list concatenation:

parameters = []
parameters.extend(model.new_layer1.parameters())
parameters.extend(model.new)layer2.parameters())
optimizer = optim.Adam(parameters, ...)
9 Likes

Thank you very much, it works :grinning:

Hi, maybe it’s silly, but if I have two sub_network, say netA and netB, the parameters of netA are put into optim.Adam (optim.Adam(model.netA.parameters())) while those of netB are not. Then what would happen to the netB’s parameters? Thanks!

2 Likes

The parameters update when you call optimizer.step(). Since your don’t have a optimizer for the parameters of netB, I think they won’t change.

Is it necessary to set .requires_grad = False? If we simply not provide those layers to the optimizer, would it work? I understand we would free up memory by not computing gradient but is it necessary?

While optimizer.step() wouldn’t update these parameters (since you’ve never passed them to the optimizer), these parameters would still accumulate the gradients.
This is of course wasteful, as these gradients are not needed (and Autograd could potentially stop the backward pass before reaching these parameters). Additionally to that, you would have to be careful, if you are planning to update these parameters in the future, e.g. by adding these parameters via optimizer.add_param_group, since they would already contain (large) gradients.

6 Likes

Hi may I ask if I set some intermediate layer’s weight, require_grad = false, will the gradients still be able to backpropagate through the intermediate layers to the front layers so they get proper updates?

Yes, this will work as seen here:

# setup
model = nn.Sequential(
    nn.Linear(1, 1),
    nn.Linear(1, 1),
    nn.Linear(1, 1)
)

# freeze middle layer
for param in model[1].parameters():
    param.requires_grad = False

# calculcate gradients
model(torch.randn(1, 1)).backward()

# check gradients
for name, param in model.named_parameters():
    print(name, param.grad)

> 0.weight tensor([[-0.0005]])
  0.bias tensor([-0.0335])
  1.weight None
  1.bias None
  2.weight tensor([[-0.6302]])
  2.bias tensor([1.])
2 Likes