Finetuning Torchvision Models - Tutorial

Etienne-Meunier · September 8, 2019, 5:12pm

Hi, I had a question regarding the official tutorial on Finetuning Torchvision Models, in that tutorial they set all the parameters of the network except the new ones ( from the new classifier layer ) to requires_grad = False and then they use this code to build the Optimizer :

`params_to_update = model_ft.parameters()
print(“Params to learn:”)
if feature_extract:
params_to_update = []
for name,param in model_ft.named_parameters():
if param.requires_grad == True:
params_to_update.append(param)
print("\t",name)
else:
for name,param in model_ft.named_parameters():
if param.requires_grad == True:
print("\t",name)

optimizer_ft = optim.SGD(params_to_update, lr=0.001, momentum=0.9)
`

( The else here is for finetuning ). But settings requires_grad = False is not enough to not train the parameters even if we pass them to the Optimizer? In my understanding of the autograd module, both cases here are doing the same thing ( training only the parameters with requires_grad = True ) even if they don’t have the same parameters in the Optimizer. Am I missing something ?

Thank you very much for your help !

ptrblck · September 8, 2019, 6:11pm

If your optimizer uses weight decay, the parameters might be updated even without having a gradient, which is why they are also filtered out of the parameter list passed to the optimizer.

Etienne-Meunier · September 8, 2019, 6:21pm

Oh right, super interesting, thank you very much for this answer !

Etienne-Meunier · September 8, 2019, 6:35pm

On a second thinking, in this case using weight decay + requires_grad = False and still having the parameters in the optimizer ( what they did in the fine tuning case ) will imply that the of the “frozen” part of the network are going to get reduced at every epoch even if they are not changed by backprop, would it be not good for the network ?

ptrblck · September 9, 2019, 1:06pm

I’m not sure, if weight decay on frozen parameters is beneficial or not and you would have to run some experiments on it.
Just for the sake of completeness, weight decay won’t be applied if the .grad attribute is None as seen here. If you set the requires_grad attribute to False from the beginning, these parameters won’t be updated, as their .grad will never be set to a valid value.

However, if you’ve applied some training to these parameters before setting requires_grad=False and only zero out their gradients, weight decay will manipulate these in the next optimizer steps.

antoon · May 3, 2020, 6:33pm

Excuse If I’m bumping this old topic. Actually fine tuning only works for me if I set requires_grad to False. If I set it to true, accuracy will always be low. Maybe should I import the weights of the imported network in some way to make it work?

ptrblck · May 4, 2020, 12:26am

You should use the pretrained model or load a state_dict for an already trained model, if you want to fine tune it.
Which parameters are you freezing and is the training as well as validation accuracy low or just one of them?

antoon · May 4, 2020, 1:32am

Loss is high, validation accuracy low( example: Epoch 50: Validation Loss =4.9031, VA=0.0100) . I use the pretrained model like this inside the init function of my model:

        alexnet = models.alexnet(pretrained=True)
        self.features = alexnet.features
        model_dict=self.state_dict()
        pretrained_dict=alexnet.state_dict()
        pretrained_dict= {k: v for k, v in pretrained_dict.items() if k in model_dict}
        model_dict.update(pretrained_dict)
        self.load_state_dict(model_dict)