How the pytorch freeze network in some layers, only the rest of the training?

L0SG · February 26, 2019, 5:27am

I’m afraid I have no definitive answer for this since I don’t know your exact model setup, but several suggestions:

Every single tensor before the frozen part in the computational graph must also be requires_grad=False so that the frozen subgraph gets excluded in the autograd engine. If there exists any tensor that requires grad, It’ll need all the backward pass of the graph anyway.
I would check which part of your model is a major bottleneck in speed. For example, if the unfrozen parts may contain parameters that require heavy-lifting like this example. Or, the bottleneck would be unrelated to the model itself, such as loading your data that overshadows the model speed.
Make sure that you use cuda.synchronize() for properly measuring the speed.

MariosOreo · February 28, 2019, 8:51am

Have you solved this problem?

Weifeng · March 1, 2019, 3:25am

@lugiavn’s method should work.
As long as you need to compute d(B)/d(params of A), you have to backpropagate gradient along paths in B to A. requires_grad attribute has to be set for B. Maybe writing custom backward function for B will be more a efficient way, it’ll be constant as there is no parameter updating.

herleeyandi · March 6, 2019, 5:20am

I just experimenting with several things and I found that the B network weight should be requires_grad = False, but the tensor should be always requires_grad = True. and for the optimizer, just optimize the A networks. It works fine for me. if you disable the grad, in the end, you need to enable again before calculating the loss, because the backpropagation need the gradient history. I don’t know the exact solution but it works fine for me and exactly what I want.

MariosOreo · March 6, 2019, 1:46pm

Thanks for your detailed explanation.

But this does not make sense to me.
Could you have a further explanation on network weight’s requires_grad and tensor’s requires_grad?

herleeyandi · March 7, 2019, 10:10am

Suppoose we have a network B part we want to freeze.

for param in networkB.conv1.parameters():
    param.requires_grad = False

For the tensor, we can set it while creating the tensor. you can see details in here.
x = torch.tensor([1], requires_grad=True)

n0obcoder · July 19, 2019, 8:06am

so it is the same thing if we dont use the filter? i mean the output will be the same right?

andics · July 18, 2020, 12:16pm

This works only to freeze layers after they have been initially unfrozen. If you want to go the other way around, you’d have to use the add_param_group argument on the Optimizer, as it will not contain the previously frozen parameter initially.

Mughees · July 26, 2020, 8:07am

I took your code snippet and modified it a little. I am trying to freeze first three layers for initial 5 epochs and then train the complete model. But my model is not working as expected for training the complete model for all the epochs. Do I need to reinitialize the optimizer after 5 epochs? or I m missing something else.

Note: I have studied the add_param_group option but do you think that will be feasible if Im training the very big models like pretrained ResNet152 as encoder in encoder-decoder models?

Thanks in advance.

import torch
from torch import nn
from torch.autograd import Variable
import torch.nn.functional as F
import torch.optim as optim

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

# toy feed-forward net
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()

        self.fc1 = nn.Linear(10, 3)
        self.fc2 = nn.Linear(3, 3)
        self.fc3 = nn.Linear(3, 3)        
        self.fc4 = nn.Linear(3, 3)
        self.fc5 = nn.Linear(3, 1)

    def forward(self, x):
        x = self.fc1(x)
        x = self.fc2(x)
        x = self.fc3(x)
        x = self.fc4(x)
        x = self.fc5(x)
        return x


net = Net()
# print the pre-trained fc2 weight
print('fc2 pretrained weight')
print(net.fc2.weight)

# define new random data
random_input = Variable(torch.randn(10,))
random_target = Variable(torch.randn(1,))

# loss
criterion = nn.MSELoss()

# NOTE: pytorch optimizer explicitly accepts parameter that requires grad
# see https://github.com/pytorch/pytorch/issues/679
optimizer = optim.Adam(filter(lambda p: p.requires_grad, net.parameters()), lr=0.1)
# this raises ValueError: optimizing a parameter that doesn't require gradients
#optimizer = optim.Adam(net.parameters(), lr=0.1)

for epoch in range(1,10):
    net.zero_grad()
    count = 0
    if epoch < 5:
        # freeze backbone layers
        for param in net.children():
            count +=1
            if count < 4: #freezing first 3 layers
                param.requires_grad = False
            
    else:
        for param in net.children():
            param.requires_grad = True
            
#    for param in net.children():
#        print(param,param.requires_grad)
        
    print('trainable parameters', count_parameters(net))
    output = net(random_input)
    loss = criterion(output, random_target)
    loss.backward()
    optimizer.step()
    print('fc2 weight at epoch:', epoch)
    print(net.fc2.weight)

Output:

fc2 pretrained weight
Parameter containing:
tensor([[ 0.5127,  0.1465, -0.5701],
        [-0.3253, -0.1051, -0.3173],
        [-0.0262,  0.2804, -0.0923]], requires_grad=True)
trainable parameters 73
fc2 weight at epoch: 1
Parameter containing:
tensor([[ 0.4127,  0.2465, -0.6701],
        [-0.4253, -0.0051, -0.4173],
        [-0.1262,  0.3804, -0.1923]], requires_grad=True)
trainable parameters 73
fc2 weight at epoch: 2
Parameter containing:
tensor([[ 0.3130,  0.3466, -0.7702],
        [-0.3617, -0.0704, -0.3515],
        [-0.0610,  0.3138, -0.1251]], requires_grad=True)
trainable parameters 73
fc2 weight at epoch: 3
Parameter containing:
tensor([[ 0.2345,  0.4255, -0.8493],
        [-0.3122, -0.1212, -0.3002],
        [-0.0103,  0.2619, -0.0729]], requires_grad=True)
trainable parameters 73
fc2 weight at epoch: 4
Parameter containing:
tensor([[ 0.1776,  0.4843, -0.9030],
        [-0.2739, -0.1612, -0.2610],
        [ 0.0299,  0.2205, -0.0319]], requires_grad=True)
trainable parameters 73
fc2 weight at epoch: 5
Parameter containing:
tensor([[ 0.1263,  0.5314, -0.9565],
        [-0.2419, -0.1954, -0.2290],
        [ 0.0637,  0.1853,  0.0025]], requires_grad=True)
trainable parameters 73
fc2 weight at epoch: 6
Parameter containing:
tensor([[ 0.0908,  0.5378, -1.0103],
        [-0.2141, -0.2259, -0.2017],
        [ 0.0928,  0.1551,  0.0321]], requires_grad=True)
trainable parameters 73
fc2 weight at epoch: 7
Parameter containing:
tensor([[ 0.0888,  0.5036, -1.0591],
        [-0.1893, -0.2535, -0.1778],
        [ 0.1180,  0.1293,  0.0580]], requires_grad=True)
trainable parameters 73
fc2 weight at epoch: 8
Parameter containing:
tensor([[ 0.1142,  0.4492, -1.1039],
        [-0.1671, -0.2785, -0.1568],
        [ 0.1398,  0.1071,  0.0808]], requires_grad=True)
trainable parameters 73
fc2 weight at epoch: 9
Parameter containing:
tensor([[ 0.1440,  0.3941, -1.1453],
        [-0.1473, -0.3008, -0.1382],
        [ 0.1591,  0.0876,  0.1011]], requires_grad=True)

Mughees · July 26, 2020, 8:18am

Got the solution: replace the lines for freezing from net.children() to net.parameters(). Now, I just want to know the difference in Children and Parameters method.

    if epoch < 5:
        # freeze backbone layers
        for param in net.parameters():
            count +=1
            if count < 4: #freezing first 3 layers
                param.requires_grad = False
            
    else:
        for param in net.parameters():
            param.requires_grad = True
            
#    for param in net.parameters():
#        print(param, param.requires_grad)

Usama_Hasan · July 27, 2020, 12:31pm

@Mughees

This yields the learning paramaters (weights,biases) of named_parameter (layer e.g Conv).

This return the modules inside the network.

This should have worked like this.

for layer in model.children():
    for parameter in layer.parameters():
          parameter.requires_grad = True

Mughees · July 28, 2020, 8:44am

Thanks alot for the explanation.

I guess we can simply avoid the nested loop by looping directly over net.parameters() method.

Usama_Hasan · July 28, 2020, 8:46am

Yes, I just wanted to clarify the implementation.

narain1 · July 30, 2020, 4:40am

Does changing the requires_grad property of the network require a reinitialization of the optimizer

Mughees · August 4, 2020, 12:07pm

You dont need to reinitialize. you just have to implement filter in optimizer when defining.

andics · August 9, 2020, 12:03pm

This took me a very long time to figure out, as it is very poorly documented, but there is a viable way to dynamically freeze/unfreeze parts of your network without needing to reinitialize your optimizer. It is pretty straight forward for single GPU/CPU training - parameter.requres_grad = False does the job regardless of the state of your optimizer.
However, for distributed training it’s more tricky. The way to do it is as follows:

...
#Assuming you've already initialized your optimizer, WITHOUT requres_grad filter
# => all parameters are included in the optimizer

model = torch.nn.parallel.DistributedDataParallel(
    model, device_ids=[local_rank], output_device=local_rank,
    find_unused_parameters=True,)

for parameter in model.parameters():
   parameter.requires_grad = False
   #You can replace model.parameters() with 
   #model.part.that.you.want.to.freeze.parameters()

The find_unused_parameters flag is very important when wrapping your model in the DDP. Without this flag, you’ll receive an error when updating your weights. Also, there’s a catch - this method of dynamically freezing/unfreezing your network does not work with PyTorch prior to 1.1-1.2 (don’t remember exactly).

Hope that helps!

Yeshwanth_Reddy · August 26, 2020, 5:04pm

If I want to freeze the backbone but not the head that is being attached - does this code work?

class Classifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.model = models.resnet34(pretrained=True)
        for param in self.model.parameters():
            param.requires_grad = False
        self.model.fc = nn.Sequential(
            nn.Linear(512, 256),
            nn.Dropout(0.2),
            nn.ReLU(inplace=True),
            nn.Linear(256, len(id2int))
        )
        self.loss_fn = nn.CrossEntropyLoss()

Usama_Hasan · August 26, 2020, 5:15pm

Hy @Yeshwanth_Reddy, It wouldn’t be so great to call it in the init function. But it will just work fine outside. Plus it would be best if you declare your function outside the Classifier class.

class Classifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.model = models.resnet34(pretrained=True)
        self.model.fc = nn.Sequential(
            nn.Linear(512, 256),
            nn.Dropout(0.2),
            nn.ReLU(inplace=True),
            nn.Linear(256, len(id2int))
        )
def train():
     net= Classifier()
     for param in net.model.parameters():
            param.requires_grad = False
    loss = nn.CrossEntropyLoss()
    for batch in dataloader():
          #Code.
          loss.backward()

Jaideep_Valani · November 16, 2020, 2:33pm

if during training i make require grad =True for frozen layer,will optimizer take new require grad into effect without init ? i just checked it dsnt takes updated require grad in account.

seungjun · November 24, 2020, 10:22am

You can call torch.nn.Module.requires_grad_ function for the corresponding modules when necessary.

import torch
import torch.nn as nn

model = nn.Conv2d(...)

# freeze model
model.requires_grad_(False)

...
# unfreeze model
model.requires_grad_(True)