How the pytorch freeze network in some layers, only the rest of the training?

earnp · September 6, 2017, 2:24am

L0SG · September 6, 2017, 2:34am

Each parameters of the model have requires_grad flag:
http://pytorch.org/docs/master/notes/autograd.html

For resnet example in the doc, this loop will freeze all layers

for param in model.parameters():
    param.requires_grad = False

For partially unfreezing some of the last layers, we can identify parameters we want to unfreeze in this loop. setting the flag to True will suffice.

SpandanMadan · September 6, 2017, 3:43am

I faced this just a few days ago, so I’m sure this code should be up to date. Here’s my answer for Resnet, but this answer can be used for literally any model.

The basic idea is that all models have a function model.children() which returns it’s layers. Within each layer, there are parameters (or weights), which can be obtained using .param() on any children (i.e. layer). Now, every parameter has an attribute called requires_grad which is by default True. True means it will be backpropagrated and hence to freeze a layer you need to set requires_grad to False for all parameters of a layer. This can be done like this -

model_ft = models.resnet50(pretrained=True)
ct = 0
for child in model_ft.children():
ct += 1
if ct < 7:
    for param in child.parameters():
        param.requires_grad = False

This freezes layers 1-6 in the total 10 layers of Resnet50. Hope this helps!

James_Chen · October 30, 2017, 7:47am

I am wondering whether to set .eval() for those frozen layers since it may still update its running mean and running var.

SpandanMadan · October 30, 2017, 9:47am

Try this. Reduce your learning rate drastically. Try viewing your gradients wrt input and see if there’s any place where they are blowing up(going to inf). If yes,see why this happens.

varghese_alex · November 25, 2017, 5:22am

Hi Spandan;

I try to replicate your code on Resnet 18. Kind of completed the code. My aim was to freeze all layers in the network except the classification layer and the layer/block preceding it. Could you please let me know your thoughts if this is right

import torch
import torchvision

model = torchvision.models.resnet18(pretrained=True)

lt=8
cntr=0

for child in model.children():
cntr+=1

if cntr < lt:
	print child
	for param in child.parameters():
		param.requires_grad = False

num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs,2)

criterion = nn.CrossEntropyLoss()

optimizer_ft = optim.SGD(filter(lambda p: p.requires_grad, model.parameters()), lr=0.001, momentum=0.9)

vmirly1 · February 2, 2018, 4:43pm

Setting .requires_grad = False should work for convolution and FC layers. But how about networks that have instanceNormalization? Is setting .requires_grad = False enough for normalization layers too?

Brando_Miranda · April 9, 2018, 8:17pm

does this work even if the network has already been trained? Say I have loaded a pre-trained net on X and I want to freeze layer Y (say 2nd layer to make the example concrete). How do I do that exactly?

L0SG · April 10, 2018, 4:20am

This snippet may clarify how to do it.

Set requires_grad to false you want to freeze:

# we want to freeze the fc2 layer
net.fc2.weight.requires_grad = False
net.fc2.bias.requires_grad = False

Then set the optimizer like the following:

optimizer = optim.SGD(filter(lambda p: p.requires_grad, net.parameters()), lr=0.1)

Alternatively, you can only add the parameters you want to train to the optimizer:
https://discuss.pytorch.org/t/to-require-grad-or-to-not-require-grad/5726

But I think the above method is more straight-forward.

milani · April 11, 2018, 9:28am

@L0SG What if we want to unfreeze those layers later in the code? For example, I want to fine tune heads first, then tune the other layers as well. We will need to instantiate a new optimizer, right? If so, does it affect the optimization?

L0SG · April 11, 2018, 1:56pm

optimizer.add_param_group would be what you want. This will unfreeze the previously frozen layer, by adding the parameters to new dict elements of param_groups (list) of the optimizer:

# let's unfreeze the fc2 layer this time for extra tuning
net.fc2.weight.requires_grad = True
net.fc2.bias.requires_grad = True

# add the unfrozen fc2 weight to the current optimizer
optimizer.add_param_group({'params': net.fc2.parameters()})

Brando_Miranda · May 1, 2018, 1:32am

if you only do:

# we want to freeze the fc2 layer
net.fc2.weight.requires_grad = False
net.fc2.bias.requires_grad = False

without the second part does it still work? i.e. without:

optimizer = optim.SGD(filter(lambda p: p.requires_grad, net.parameters()), lr=0.1)

L0SG · May 1, 2018, 3:10am

Yes, it does work when you add the parameters with requires_grad=True to the optimizer then setting to False after. You can also find out yourself by commenting out

optimizer = optim.SGD(filter(lambda p: p.requires_grad, net.parameters()), lr=0.1)

In the snippet above, since the previous optimizer contains all parameters including the fc2 with the changed requires_grad flag.

Note that the above snippet assumed a common “train => save => load => freeze parts” scenario.

cysmith · August 24, 2018, 9:36am

Snippet for recursively freezing a portion of your graph.

def dfs_freeze(model):
    for name, child in model.named_children():
        for param in child.parameters():
            param.requires_grad = False
        dfs_freeze(child)

Then:

optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=opt.lr, amsgrad=True)

daisy · September 16, 2018, 3:11am

@L0SG, When I unfreeze the previously frozen layers and also want to change the learning rate and the Momentum, but the parameters of these layers don’t have ‘momentum_bufer’ in optimizer’s state, do you have any suggestions to this issue?

herleeyandi · February 8, 2019, 8:19pm

Hello I am still confuse about freeze the weight in Pytorch looks like very hard to do. Suppose I want to make a loss function which filtering the loss using the initialized kernel. I am using nn.conv2D to do my job but I don’t want the weight being updated(freeze). The loss function basically the simple network let said the A network is the main network that will be updated, and B Network which is the network for computing the loss function. For this task I enabled the grad for A and disable for B. During the training my program will taking that loss from B, then backpropagate into the main network A (where the weight should be update). However I always ended with this.
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Anyone have the better way to do this?, can I just use modelB.eval() ? or modelB.Train(False)?

lugiavn · February 8, 2019, 9:15pm

You can try setting the learning rate for those to 0. Optimizer support different learning rate for each param groups, it can be adjusted online, look that up.
Note that layers (such that BatchNorm) do the learning in forward phase, so you have to set them to eval() too

rohun · February 24, 2019, 3:49am

If we set the requires_grad to false for a particular layer, do we have to leave it out of the optimizer?
Such as, this ->

optimizer = optim.SGD(filter(lambda p: p.requires_grad, net.parameters()), lr=0.1)

Is the benefit in the speedup? Or is it wrong if one does not filter it out of the optimizer?
Thanks

L0SG · February 24, 2019, 12:59pm

The error from adding already frozen parameters to the optimizer is an explicit design choice to avoid silly mistakes (one could mistakenly add the frozen one and assume that the parameter is being trained). But manually freezing parameters after declaring the optimizer works as intended without re-defining the optimizer. (my own replay above in this thread)

rohun · February 24, 2019, 7:48pm

Thanks for the clarification!
As a follow-up to my point regarding the speed up - I am not observing a speedup when I freeze the initial 80% of the network. I expected the training to be faster, since it only has to update 20% and to be lighter, since it only has to store the information to execute a backward for 20% of the network.
Is there a speedup expected in this scenario?