How to train a part of a network

alan_ayu · October 21, 2017, 1:17pm

I am building a model which consists of a subnet and some layers followed, the subnet has been trained before, and now I want to train the following new layers with the parameters of the subnet fixed.
I want to know how to remove the parameters from the model.parameters() when optimizing the net using Adam.
Any advice would be appreciated !!!

colesbury · October 21, 2017, 4:15pm

The transfer learning tutorial has a relevant section:

http://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html#convnet-as-fixed-feature-extractor

To summarize:

Set requires_grad=False for all parameters you do not wish to optimizer. This avoids computing gradients for them:
```
for param in base_model.parameters():
 param.requires_grad = False
```

Call .parameters() on the part of the sub-network:

optim.Adam(model.sub_network.parameters(), ...)

If your new layers aren’t entirely contained in a single Module, you can collect parameters by using list concatenation:

parameters = []
parameters.extend(model.new_layer1.parameters())
parameters.extend(model.new)layer2.parameters())
optimizer = optim.Adam(parameters, ...)

alan_ayu · October 21, 2017, 4:51pm

Thank you very much, it works

roger_p · July 20, 2018, 10:31am

Hi, maybe it’s silly, but if I have two sub_network, say netA and netB, the parameters of netA are put into optim.Adam (optim.Adam(model.netA.parameters())) while those of netB are not. Then what would happen to the netB’s parameters? Thanks!

kaixin · September 13, 2018, 1:55pm

The parameters update when you call optimizer.step(). Since your don’t have a optimizer for the parameters of netB, I think they won’t change.

Rakshit_Kothari · June 16, 2020, 2:05pm

Is it necessary to set .requires_grad = False? If we simply not provide those layers to the optimizer, would it work? I understand we would free up memory by not computing gradient but is it necessary?

ptrblck · June 17, 2020, 6:58am

While optimizer.step() wouldn’t update these parameters (since you’ve never passed them to the optimizer), these parameters would still accumulate the gradients.
This is of course wasteful, as these gradients are not needed (and Autograd could potentially stop the backward pass before reaching these parameters). Additionally to that, you would have to be careful, if you are planning to update these parameters in the future, e.g. by adding these parameters via optimizer.add_param_group, since they would already contain (large) gradients.

qcyza · March 4, 2021, 8:45pm

Hi may I ask if I set some intermediate layer’s weight, require_grad = false, will the gradients still be able to backpropagate through the intermediate layers to the front layers so they get proper updates?

ptrblck · March 5, 2021, 5:42am

Yes, this will work as seen here:

# setup
model = nn.Sequential(
    nn.Linear(1, 1),
    nn.Linear(1, 1),
    nn.Linear(1, 1)
)

# freeze middle layer
for param in model[1].parameters():
    param.requires_grad = False

# calculcate gradients
model(torch.randn(1, 1)).backward()

# check gradients
for name, param in model.named_parameters():
    print(name, param.grad)

> 0.weight tensor([[-0.0005]])
  0.bias tensor([-0.0335])
  1.weight None
  1.bias None
  2.weight tensor([[-0.6302]])
  2.bias tensor([1.])