How the pytorch freeze network in some layers, only the rest of the training?

Yes, it does work when you add the parameters with requires_grad=True to the optimizer then setting to False after. You can also find out yourself by commenting out

optimizer = optim.SGD(filter(lambda p: p.requires_grad, net.parameters()), lr=0.1)

In the snippet above, since the previous optimizer contains all parameters including the fc2 with the changed requires_grad flag.

Note that the above snippet assumed a common “train => save => load => freeze parts” scenario.

Snippet for recursively freezing a portion of your graph.

def dfs_freeze(model):
    for name, child in model.named_children():
        for param in child.parameters():
            param.requires_grad = False
        dfs_freeze(child)

Then:

optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=opt.lr, amsgrad=True)
4 Likes

@L0SG, When I unfreeze the previously frozen layers and also want to change the learning rate and the Momentum, but the parameters of these layers don’t have ‘momentum_bufer’ in optimizer’s state, do you have any suggestions to this issue?

class YourModel(nn.Module):
    
    def __init__(self, in_features, out_features):
        super(YourModel, self).__init__()
        self.fc = nn.Linear(in_features, out_features)

    def forward(self, x):
        return self.fc(x)

model = YourModel(in_features, out_features)
model.fc.train(False)

Hello I am still confuse about freeze the weight in Pytorch looks like very hard to do. Suppose I want to make a loss function which filtering the loss using the initialized kernel. I am using nn.conv2D to do my job but I don’t want the weight being updated(freeze). The loss function basically the simple network let said the A network is the main network that will be updated, and B Network which is the network for computing the loss function. For this task I enabled the grad for A and disable for B. During the training my program will taking that loss from B, then backpropagate into the main network A (where the weight should be update). However I always ended with this.
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Anyone have the better way to do this?, can I just use modelB.eval() ? or modelB.Train(False)?

1 Like

You can try setting the learning rate for those to 0. Optimizer support different learning rate for each param groups, it can be adjusted online, look that up.
Note that layers (such that BatchNorm) do the learning in forward phase, so you have to set them to eval() too

3 Likes

If we set the requires_grad to false for a particular layer, do we have to leave it out of the optimizer?
Such as, this ->

optimizer = optim.SGD(filter(lambda p: p.requires_grad, net.parameters()), lr=0.1)

Is the benefit in the speedup? Or is it wrong if one does not filter it out of the optimizer?
Thanks

The error from adding already frozen parameters to the optimizer is an explicit design choice to avoid silly mistakes (one could mistakenly add the frozen one and assume that the parameter is being trained). But manually freezing parameters after declaring the optimizer works as intended without re-defining the optimizer. (my own replay above in this thread)

Thanks for the clarification!
As a follow-up to my point regarding the speed up - I am not observing a speedup when I freeze the initial 80% of the network. I expected the training to be faster, since it only has to update 20% and to be lighter, since it only has to store the information to execute a backward for 20% of the network.
Is there a speedup expected in this scenario?

I’m afraid I have no definitive answer for this since I don’t know your exact model setup, but several suggestions:

  1. Every single tensor before the frozen part in the computational graph must also be requires_grad=False so that the frozen subgraph gets excluded in the autograd engine. If there exists any tensor that requires grad, It’ll need all the backward pass of the graph anyway.
  2. I would check which part of your model is a major bottleneck in speed. For example, if the unfrozen parts may contain parameters that require heavy-lifting like this example. Or, the bottleneck would be unrelated to the model itself, such as loading your data that overshadows the model speed.
  3. Make sure that you use cuda.synchronize() for properly measuring the speed.
2 Likes

Have you solved this problem?

@lugiavn’s method should work.
As long as you need to compute d(B)/d(params of A), you have to backpropagate gradient along paths in B to A. requires_grad attribute has to be set for B. Maybe writing custom backward function for B will be more a efficient way, it’ll be constant as there is no parameter updating.

I just experimenting with several things and I found that the B network weight should be requires_grad = False, but the tensor should be always requires_grad = True. and for the optimizer, just optimize the A networks. It works fine for me. if you disable the grad, in the end, you need to enable again before calculating the loss, because the backpropagation need the gradient history. I don’t know the exact solution but it works fine for me and exactly what I want.

Thanks for your detailed explanation.

But this does not make sense to me.
Could you have a further explanation on network weight’s requires_grad and tensor’s requires_grad?

Suppoose we have a network B part we want to freeze.

for param in networkB.conv1.parameters():
    param.requires_grad = False

For the tensor, we can set it while creating the tensor. you can see details in here.
x = torch.tensor([1], requires_grad=True)

1 Like

so it is the same thing if we dont use the filter? i mean the output will be the same right?

This works only to freeze layers after they have been initially unfrozen. If you want to go the other way around, you’d have to use the add_param_group argument on the Optimizer, as it will not contain the previously frozen parameter initially.

I took your code snippet and modified it a little. I am trying to freeze first three layers for initial 5 epochs and then train the complete model. But my model is not working as expected for training the complete model for all the epochs. Do I need to reinitialize the optimizer after 5 epochs? or I m missing something else.

Note: I have studied the add_param_group option but do you think that will be feasible if Im training the very big models like pretrained ResNet152 as encoder in encoder-decoder models?

Thanks in advance.

import torch
from torch import nn
from torch.autograd import Variable
import torch.nn.functional as F
import torch.optim as optim

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

# toy feed-forward net
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()

        self.fc1 = nn.Linear(10, 3)
        self.fc2 = nn.Linear(3, 3)
        self.fc3 = nn.Linear(3, 3)        
        self.fc4 = nn.Linear(3, 3)
        self.fc5 = nn.Linear(3, 1)

    def forward(self, x):
        x = self.fc1(x)
        x = self.fc2(x)
        x = self.fc3(x)
        x = self.fc4(x)
        x = self.fc5(x)
        return x


net = Net()
# print the pre-trained fc2 weight
print('fc2 pretrained weight')
print(net.fc2.weight)

# define new random data
random_input = Variable(torch.randn(10,))
random_target = Variable(torch.randn(1,))

# loss
criterion = nn.MSELoss()

# NOTE: pytorch optimizer explicitly accepts parameter that requires grad
# see https://github.com/pytorch/pytorch/issues/679
optimizer = optim.Adam(filter(lambda p: p.requires_grad, net.parameters()), lr=0.1)
# this raises ValueError: optimizing a parameter that doesn't require gradients
#optimizer = optim.Adam(net.parameters(), lr=0.1)

for epoch in range(1,10):
    net.zero_grad()
    count = 0
    if epoch < 5:
        # freeze backbone layers
        for param in net.children():
            count +=1
            if count < 4: #freezing first 3 layers
                param.requires_grad = False
            
    else:
        for param in net.children():
            param.requires_grad = True
            
#    for param in net.children():
#        print(param,param.requires_grad)
        
    print('trainable parameters', count_parameters(net))
    output = net(random_input)
    loss = criterion(output, random_target)
    loss.backward()
    optimizer.step()
    print('fc2 weight at epoch:', epoch)
    print(net.fc2.weight) 

Output:

fc2 pretrained weight
Parameter containing:
tensor([[ 0.5127,  0.1465, -0.5701],
        [-0.3253, -0.1051, -0.3173],
        [-0.0262,  0.2804, -0.0923]], requires_grad=True)
trainable parameters 73
fc2 weight at epoch: 1
Parameter containing:
tensor([[ 0.4127,  0.2465, -0.6701],
        [-0.4253, -0.0051, -0.4173],
        [-0.1262,  0.3804, -0.1923]], requires_grad=True)
trainable parameters 73
fc2 weight at epoch: 2
Parameter containing:
tensor([[ 0.3130,  0.3466, -0.7702],
        [-0.3617, -0.0704, -0.3515],
        [-0.0610,  0.3138, -0.1251]], requires_grad=True)
trainable parameters 73
fc2 weight at epoch: 3
Parameter containing:
tensor([[ 0.2345,  0.4255, -0.8493],
        [-0.3122, -0.1212, -0.3002],
        [-0.0103,  0.2619, -0.0729]], requires_grad=True)
trainable parameters 73
fc2 weight at epoch: 4
Parameter containing:
tensor([[ 0.1776,  0.4843, -0.9030],
        [-0.2739, -0.1612, -0.2610],
        [ 0.0299,  0.2205, -0.0319]], requires_grad=True)
trainable parameters 73
fc2 weight at epoch: 5
Parameter containing:
tensor([[ 0.1263,  0.5314, -0.9565],
        [-0.2419, -0.1954, -0.2290],
        [ 0.0637,  0.1853,  0.0025]], requires_grad=True)
trainable parameters 73
fc2 weight at epoch: 6
Parameter containing:
tensor([[ 0.0908,  0.5378, -1.0103],
        [-0.2141, -0.2259, -0.2017],
        [ 0.0928,  0.1551,  0.0321]], requires_grad=True)
trainable parameters 73
fc2 weight at epoch: 7
Parameter containing:
tensor([[ 0.0888,  0.5036, -1.0591],
        [-0.1893, -0.2535, -0.1778],
        [ 0.1180,  0.1293,  0.0580]], requires_grad=True)
trainable parameters 73
fc2 weight at epoch: 8
Parameter containing:
tensor([[ 0.1142,  0.4492, -1.1039],
        [-0.1671, -0.2785, -0.1568],
        [ 0.1398,  0.1071,  0.0808]], requires_grad=True)
trainable parameters 73
fc2 weight at epoch: 9
Parameter containing:
tensor([[ 0.1440,  0.3941, -1.1453],
        [-0.1473, -0.3008, -0.1382],
        [ 0.1591,  0.0876,  0.1011]], requires_grad=True)

Got the solution: replace the lines for freezing from net.children() to net.parameters(). Now, I just want to know the difference in Children and Parameters method.

    if epoch < 5:
        # freeze backbone layers
        for param in net.parameters():
            count +=1
            if count < 4: #freezing first 3 layers
                param.requires_grad = False
            
    else:
        for param in net.parameters():
            param.requires_grad = True
            
#    for param in net.parameters():
#        print(param, param.requires_grad)

@Mughees

This yields the learning paramaters (weights,biases) of named_parameter (layer e.g Conv).

This return the modules inside the network.

This should have worked like this.

for layer in model.children():
    for parameter in layer.parameters():
          parameter.requires_grad = True
1 Like