I am doing some transfer learning, and set a function inside my NN class as so:
for name, child in self.arch.named_children():
if name != "_fc":
for param in child.parameters():
param.requires_grad = False
for name, param in self.arch.named_parameters():
if("bn" in name):
param.requires_grad = True
Probably a bit convoluted I know, but I was basically just wanting to freeze everything except all BatchNorms and my classifier.
I instantiate my model like so:
arch = EfficientNet.from_pretrained(param['arch'])
model = Net(arch=arch, n_meta_features=len(meta_features)) # New model for each fold
model = nn.DataParallel(model)
model = model.to(device)
I think I must be doing something wrong however, because I would have suspected my optimizer to throw an error, as i was calling it like so:
as it’s my understanding that the optimizer will throw an error if it encounters frozen parameters. I also noticed that my training time (about 1min 47sec per epoch) is essentially unchanged as well, and I would have suspected it to train faster.
I believe that the optimizer assigns None to all those places were you have set requires_grad=False. As a result, the gradients of all those parameters are not updating. An optimizer works with frozen params even when you do optim = torch.optim.AdamW(model.parameters(), lr=param['lr'], amsgrad=True),
as essentially, the params have just requires_grad set as False. My suggestion is unfreeze the params and you will see a difference in training time
If the optimizer is smart enough so that it looks at the requires_grad, why do people pass lambda functions into the optimizer, why not just pass in model.parameters()?
As for my training time, because I was setting all the BatchNorm to be unfrozen, and those are scattered literally all throughout the model, backdrop still had to be done, even though it wasn’t updating the parameters, since everything with backpropogation is chained/dependent. I tried to just freeze the classifier and saw a difference in speed, but since the actual updates to the weights is super quick, its the actual calculations that take all the time, that is why I wasn’t seeing a speed increase.
So I guess I am still unclear though. Is there any difference or benefit to passing in model.parameters() vs passing in filter(lambda p: p.requires_grad, model.parameters()) when you have parts of your model frozen?
Assuming that you set all the requires_grad flags before training at all, there is very little difference in using filter.
In the optimizer code, it iterates over each parameter and if the gradient is None, it continues. If requires_grad=False, the parameter’s gradient will never be updated thus it will always remains None. For this example, I think filter adds very little performance boost.
Thank you, it makes sense. Perhaps the behavior of at least some optimizers was different in versions past? I had read that the optimizer would throw an error if it encountered frozen parameters, that is why I was concerned that perhaps my parameters were not set right.