Why is it when I call require_grad = False on all my params my weights in the network would still update?

inkplay · July 31, 2018, 5:43pm

What I am trying to do right now is to write a multi layer conv2d encoder and freeze the weights from updating for the earlier layers. This hopefully would give me back a similar effect like progressively growing the layers. This way I can initialize the complete network first without worrying about how to mix and match and add new layers to the network. So before I start to write a complex model I thought I would experiment with freezing a small network for testing purpose. The result is already different from what I was expecting. I assumed setting grad to zero would stop the updating of the weights so the end weight would stay the same but I was wrong. Below is my testing code.

encoder = nn.Sequential( nn.Conv2d(1, 4, 1), nn.ReLU() ).cuda()

criterion = torch.nn.BCELoss()
optimizer = torch.optim.Adam(encoder.parameters(), lr=0.001)

for params in encoder.parameters():
    print('params:' params)
    prams.require_grad = False # Freeze all weights

params: [ [[[ 0.1500]]], [[[0.9332]]], [[[-0.1422]]], [[[-0.7685]] ] ....

epochs = 2
target = torch.randn(32, 1, 4, 4).cuda()
for e in range(epochs):
    random_input = torch.randn(32, 1, 4, 4).cuda()
    Y_pred = encoder(random_input)
    loss = critertion( Y_pred, Y )
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if epoch % 2 == 0:
        print('[Epoch:{} -- Loss:{:.4f}]'.format(epoch, loss.item()))

[Epoch:0 -- Loss: 0.8634] # loss getting updated
[Epoch:2 -- Loss: 0.8574]

for params in encoder.parameters(): #looping through the encoder to see if my weights are still the same
    print('params:' params)

params: [ [[[ 0.1433]]], [[[0.9233]]], [[[-0.1333]]], [[[-0.7586]] ] ... # weight value updates?

As you can see if I print out the new param after 2 epochs my weights still got updated. I would like to cancel those update and only use them as input conversion layers like toRGB or fromRGB layers in the progressive gan paper.

albanD · July 31, 2018, 5:48pm

Hi,

This looks like l2 regularization or similar behaviour of the optimizer: all your weights are slightly closer to 0.

inkplay · July 31, 2018, 6:44pm

After reading your comment I used SGD instead of Adam and set weight decay to 0 which suppose to stop L2 regularization and my weight still updates? Am I misunderstanding the concept of freezing weights? Are the weights suppose to be updated but never used or their values are not suppose to change?

ptrblck · July 31, 2018, 6:56pm

We’ve had this discussion a while ago and for investigation of this effect I created a gist showing how optim.Adam updates the parameters even without a gradient once the running estimates were set.
Could you compare your code to the example and make sure no momentum etc. is set?

inkplay · August 1, 2018, 5:08pm

Hi ptrblck I thought ill make it easier by switching to SGD instead and I have set everything to default values and I made sure momentum is zero, I even restarted my computer just to make sure I have a clean slate. I still have this problem, I don’t think its just Adam because of SGD still updates my weights after the require_grad = False. Actually while I was typing this I added a print statement right after require_grad = False and it prints require_grad to True?? Did I write my code wrong?

My code:

import torch
import torch.nn as nn

encoder = nn.Sequential( nn.Conv2d(1, 4, 1), nn.Sigmoid())

for params in encoder.parameters():
    
    params.require_grad = False
    print(params.requires_grad)

The print statement comes after I change require_grad to False but if I print out the setting I would get 2 True outputs.

SimonW · August 1, 2018, 5:23pm

requires_grad

you are missing an “s”

inkplay · August 1, 2018, 5:34pm

Oh my god, thank you no wonder this doesn’t work I had a typo all along. So what is require_grad then? It doesn’t throw an error. Thank you so much SimonW you must be a very handsome person in real life.

SimonW · August 1, 2018, 5:36pm

It’s just assigning a new attribute. For python objects, if you do a.b = c, a.b doesn’t have to exist before this.

inkplay · August 1, 2018, 5:37pm

Thank you I am self taught and this was a good mistake to learn from.

Julien_Jorda · October 23, 2019, 8:47pm

Hi @ptrblck ,
I was wondering if you finally found a workaround to this problem. I am doing some Transfer Learning where I initialize a model with the parameters of another previously trained model and freeze the few last fully connected layers. Here is how I typically do it:

for name, param in self.model.named_parameters():
	#tells whether we want to use gradients for a given parameter 
	if freeze:
	    param.requires_grad = False
	    print("Freezing parameter "+name)

Transfer looks fine and parameters are initially identical, but when I compare the respective min,max,mean and std values for each parameter of each layer in both models, some of the frozen instances start to vary after a few epochs. See below a case where I froze group1-4:
The model I am transferring the weights from:

Name                            Min          Max          Mean           Std  
----------------------  -----------  -----------  ------------  ------------  
module.group1.0.weight  -0.13601      0.135853     0.000328239    0.078537    
module.group1.0.bias    -0.129506     0.13031     -0.000709442    0.0761818   
module.group2.0.weight  -0.0156249    0.015625    -6.8284e-06     0.00902701  
module.group2.0.bias    -0.0150966    0.0152359    0.000233364    0.00887235  
module.group3.0.weight  -0.0110485    0.0110485    8.25103e-06    0.00637962  
module.group3.0.bias    -0.0109931    0.0109642   -0.000212902    0.00620885  
module.group4.0.weight  -0.0078125    0.0078125   -1.07069e-06    0.00451099  
module.group4.0.bias    -0.0077329    0.00775102  -0.000157763    0.00451984  
module.fc1.0.weight     -0.00195312   0.00195312  -1.05901e-08    0.00112767  
module.fc1.0.bias       -0.00195279   0.0019521    5.93193e-05    0.00113513  
module.fc2.0.weight     -0.0312486    0.0312499   -2.94543e-05    0.0180225   
module.fc2.0.bias       -0.0312394    0.0289709   -0.00238465     0.0186226   
module.fc3.0.weight     -0.100976     0.0989116   -0.00164936     0.0606025   
module.fc3.0.bias       -0.059265    -0.059265    -0.059265     nan

The model that I initialized thtough TL:

Name                            Min          Max          Mean           Std  
----------------------  -----------  -----------  ------------  ------------  
module.group1.0.weight  -0.136078     0.136051     0.00138295     0.0788667   
module.group1.0.bias    -0.135537     0.135878     0.00912299     0.0691942   
module.group2.0.weight  -0.0156247    0.0156249   -2.81046e-05    0.00902321  
module.group2.0.bias    -0.0151269    0.0152803    0.000945539    0.0088397   
module.group3.0.weight  -0.0110485    0.0110485   -7.81598e-06    0.00637801  
module.group3.0.bias    -0.0110323    0.0109976   -0.000282283    0.00675859  
module.group4.0.weight  -0.0078125    0.0078125   -8.4189e-07     0.00451147  
module.group4.0.bias    -0.00777942   0.00779467  -2.26952e-05    0.00466924  
module.fc1.0.weight     -0.00195312   0.00195312   1.48078e-07    0.00112768  
module.fc1.0.bias       -0.00194499   0.00195289   5.32243e-05    0.00112104  
module.fc2.0.weight     -0.0312488    0.0312494   -5.54657e-06    0.0180232   
module.fc2.0.bias       -0.0304042    0.0306912    0.00134896     0.018436    
module.fc3.0.weight     -0.0996469    0.101409    -0.00436459     0.0568807   
module.fc3.0.bias       -0.0561954   -0.0561954   -0.0561954    nan

Any insight would be appreciated!

Thanks,

ptrblck · October 23, 2019, 9:24pm

Are you freezing the parameters from the beginning and are you using e.g. weight decay?
If so, could only pass the parameters which require grads to the optimizer and run the code again?

Julien_Jorda · October 23, 2019, 9:29pm

I am actually freezing them from the beginning and I do use weight decay.
I believe I am already passing only the parameters that require grads to the optimizer. See below:

self.optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, self.model.parameters()), lr=self.learning_rate, weight_decay=self.penalty)

ptrblck · October 23, 2019, 9:50pm

Your approach looks alright.
Could you post a minimal code snippet to reproduce the issue?

Nicole_Yan · January 28, 2020, 4:53am

Hi Julien,

Any luck or insights on this? I have similar issues: use transfer learning, freeze some layers, and the weights of those frozen layers still get updated.

ptrblck · January 28, 2020, 5:13am

Were these parameters trained before and are you using an optimizer with internal states, e.g. Adam?
If so, note that the running internal states might still update the frozen parameters, as seen in this code snippet:

# Setup
class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.enc = nn.Linear(64, 10)
        self.dec = nn.Linear(10, 64)
        
    def forward(self, x):
        x = F.relu(self.enc(x))
        x = self.dec(x)

        return x


x = torch.randn(1, 64)
y = x.clone()
model = MyModel()
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=1.)

# dummy updates
for idx in range(10):
    optimizer.zero_grad()
    output = model(x)
    loss = criterion(output, y)
    loss.backward()
    optimizer.step()
    print('Iter{}, loss {}'.format(idx, loss.item()))

optimizer.zero_grad()
# Freeze encoder
for param in model.enc.parameters():
    param.requires_grad_(False)

# Store reference parameter
enc_weight0 = model.enc.weight.clone()

# Update for more iterations
for idx in range(10):
    optimizer.zero_grad()
    output = model(x)
    loss = criterion(output, y)
    loss.backward()
    optimizer.step()
    print('Iter{}, loss {}'.format(idx, loss.item()))
    print('max abs diff in enc.weight {}'.format(
        (enc_weight0 - model.enc.weight).abs().max()))
    print('sum abs grad in enc.weight {}'.format(
        model.enc.weight.grad.abs().sum()))

Nicole_Yan · January 28, 2020, 6:36pm

Hi @ptrblck
I’m using pre-trained Places365-resnet50 as a base model and added a new fc layer. Only the newly added fc layer is trained to classify sun attributes. So in one pass I can predict both places 365 categories and sun attributes.

Here is my model:

# the architecture to use
arch = 'resnet50'

# load the pre-trained weights
model_file = '%s_places365.pth.tar' % arch
if not os.access(model_file, os.W_OK):
    weight_url = 'http://places2.csail.mit.edu/models_places365/' + '%s_places365.pth.tar' % arch
    os.system('wget ' + weight_url)
model = models.__dict__[arch](num_classes=365)
checkpoint = torch.load(model_file, map_location=lambda storage, loc: storage)
state_dict = {str.replace(k, 'module.', ''): v for k, v in checkpoint['state_dict'].items()}
model.load_state_dict(state_dict)

class CustomizedResNet(nn.Module):

    def __init__(self):
        super(CustomizedResNet, self).__init__()

        # Resnet 50 as base model
        self.base_model = model

        def hook_feature(module, input, output):
            self.feature = output

        self.base_model._modules.get('avgpool').register_forward_hook(hook_feature)

        self.scene_attr_fc = nn.Linear(2048, 102)

        # freeze weights
        for param in self.base_model.parameters():
            param.requires_grad = False

        for param in self.scene_attr_fc.parameters():
            param.requires_grad = True

    def forward(self, x):

        places365_output = self.base_model(x)

        # compute scene attributes
        # feed the outputs from avgpool to the new fc layer
        attributes_output = self.feature.view(self.feature.size(0), -1)
        attributes_output = self.scene_attr_fc(attributes_output)

        return places365_output, attributes_output

customized_model = CustomizedResNet()

And I only pass the parameters of the new fc layer to the optimizer.

criterion = torch.nn.BCEWithLogitsLoss()
optimizer = optim.SGD(customized_model.scene_attr_fc.parameters(), lr=learning_rate)

My training parts look like this:

torch.save(customized_model.state_dict(), 'before_training.pth')

for epoch in range(num_epochs):

       customized_model.train()
       for i, (inputs, labels) in enumerate(dataloader):
            inputs = inputs.to(device)
            labels = labels.to(device)

            # zero the parameter gradients
            optimizer.zero_grad()

            with torch.set_grad_enabled(True):
                _, scene_attr_outputs = customized_model(inputs)
                loss = criterion(scene_attr_outputs, labels)
                loss.backward()
                optimizer.step()
       torch.save(customized_model.state_dict(), 'model_saved_at_epoch_%s.pth' % epoch)

And my testing parts look like this:

data_transforms = {
    'test': transforms.Compose([
        transforms.Resize((224,224)),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ])
}

customized_model.load_state_dict(torch.load('model_saved_at_epoch_9.pth'))
customized_model.eval()

images_test = ['stone.jpg']

for img_path in images_test:
    # load test image
    img = Image.open(img_path).convert('RGB')
    img = data_transforms['test'](img)
    img = img.to(device)

    # prediction
    places365_outputs, scene_attr_outputs = customized_model.forward(img.unsqueeze(0))

    # prediction for places365
    # print(places365_outputs.shape) -> torch.Size([1, 365])
    h_x = F.softmax(places365_outputs, 1).data.squeeze()
    probs, idx = h_x.sort(0, True)
    print('places 365 prediction on {}'.format(img_path))
    for i in range(0, 5):
       # classes stores all 365 labels
       print('{:.3f} -> {}'.format(probs[i], classes[idx[i]]))

The problem is, if I use the models saved at different epoch to predict the same image, the prediction of places 365 is changing even if I already freeze all weights for places 365 branch.
For example,
If I use the model saved before any training happens, its prediction is

places 365 prediction on stone.jpg
0.298 -> coast
0.291 -> ocean
0.172 -> beach
0.067 -> ice_floe
0.051 -> sky

If I use model_saved_at_epoch_3.pth, it gives

places 365 prediction on stone.jpg
0.308 -> coast
0.259 -> ocean
0.132 -> beach
0.107 -> sky
0.041 -> cliff

model_saved_at_epoch_13.pth gives:

places 365 prediction on stone.jpg
0.295 -> coast
0.234 -> ocean
0.152 -> sky
0.111 -> beach
0.047 -> cliff

I even compared the weights of the base_model after every epoch to the original weights, and it looks like the weights didn’t change:

# before training occurs
original_weights = []
for name, param in customized_model.base_model.named_parameters():
    original_weights.append(param.clone())

for epoch in range(num_epochs):
       # training .....
       max_abs_diff_sum = 0
       idx = 0
       for epoch_name, epoch_param in customized_model.base_model.named_parameters():
            max_abs_diff_sum += (original_weights[idx] - epoch_param).abs().max()
            idx += 1
       print(max_abs_diff_sum)  # all print tensor(0., device='cuda:0'). So I think the weights of base_model didnt change

Do you have any idea on why the probability distribution is changing even if I froze all weights for the places 365 branch (and it also looks like the base_model weights are the same)?
Thank you.

ptrblck · January 28, 2020, 7:48pm

If you are using batch norm layers in the base model, the running estimates will still be updated even if you’ve frozen the affine parameters.
To fix the running stats, you would have to call .eval() on all batch norm layers.
Also, dropout layers might still be active, which could explain the different results.
You could also call .eval() on all dropout layers or alternatively on the self.base_model to disable these effects.

Nicole_Yan · January 29, 2020, 5:03pm

Fixed! Thank you so much!

hoangcuong2011 · October 10, 2020, 1:03am

Hi, I found this both interesting and crucial. I want to share my experience here as well and hope it is useful.

Let us assume we have a variable B. I have seen case where I explicitly set B.require_grad = False but B is still updated during fine-tuning. This could happen even when I put operations involved with B within torch.no_grad().

My solution is to modify the code declaration of B in the code. Something like this:

self.register_buffer(“B”, torch.ones(the shape of B looks like), requires_grad=False))

With that when I load a pretrained model I noticed B is loaded properly. More importantly, when I fine-tune the model I noticed B is indeed frozen. This is a bit manual but it works.