Finetuning Torchvision Models - Tutorial

Hi, I had a question regarding the official tutorial on Finetuning Torchvision Models, in that tutorial they set all the parameters of the network except the new ones ( from the new classifier layer ) to requires_grad = False and then they use this code to build the Optimizer :

`params_to_update = model_ft.parameters()
print(“Params to learn:”)
if feature_extract:
params_to_update = []
for name,param in model_ft.named_parameters():
if param.requires_grad == True:
params_to_update.append(param)
print("\t",name)
else:
for name,param in model_ft.named_parameters():
if param.requires_grad == True:
print("\t",name)

optimizer_ft = optim.SGD(params_to_update, lr=0.001, momentum=0.9)
`

( The else here is for finetuning ). But settings requires_grad = False is not enough to not train the parameters even if we pass them to the Optimizer? In my understanding of the autograd module, both cases here are doing the same thing ( training only the parameters with requires_grad = True ) even if they don’t have the same parameters in the Optimizer. Am I missing something ?

Thank you very much for your help !

If your optimizer uses weight decay, the parameters might be updated even without having a gradient, which is why they are also filtered out of the parameter list passed to the optimizer.

Oh right, super interesting, thank you very much for this answer !

On a second thinking, in this case using weight decay + requires_grad = False and still having the parameters in the optimizer ( what they did in the fine tuning case ) will imply that the of the “frozen” part of the network are going to get reduced at every epoch even if they are not changed by backprop, would it be not good for the network ?

I’m not sure, if weight decay on frozen parameters is beneficial or not and you would have to run some experiments on it.
Just for the sake of completeness, weight decay won’t be applied if the .grad attribute is None as seen here. If you set the requires_grad attribute to False from the beginning, these parameters won’t be updated, as their .grad will never be set to a valid value.

However, if you’ve applied some training to these parameters before setting requires_grad=False and only zero out their gradients, weight decay will manipulate these in the next optimizer steps.