Is it necessary to disable gradient computation in fine-tuning?

mrityu · October 14, 2021, 9:52am

Hey.

Generally, when we fine-tune a classifier by keeping a pre-trained model as a feature-extractor only, we set the requires_grad = False for the pre-trained block and only train the newly added FC layer.
For eg., see the code snippet below:-

# Setting up the model
# Note that the parameters of imported models are set to requires_grad=True by default

res_mod = models.resnet34(pretrained=True)
for param in res_mod.parameters():
    param.requires_grad = False

num_ftrs = res_mod.fc.in_features
res_mod.fc = nn.Linear(num_ftrs, 2)

res_mod = res_mod.to(device)
criterion = nn.CrossEntropyLoss()

# Here's another change: instead of all parameters being optimized
# only the params of the final layers are being optimized

optimizer_ft = optim.SGD(res_mod.fc.parameters(), lr=0.001, momentum=0.9)

exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=7, gamma=0.1)

My question is, is it necessary to set requires_grad=False in fine-tuning because we are anyways specifying the parameters in optimizer_ft which needs to update i.e. the last FC layer params?

I know it will be a computational disaster and shouldn’t be done this way, but I am just asking out of curiosity.

OrielBanne · October 14, 2021, 2:26pm

are you trying to freeze a pre trained net and train an FC layer for, say, classification? so the net gives you features it learned elsewhere and FC learns the classification task?
fine tuning includes re-train to a model that is pre-trained, you should not freeze a model if you would like to fine tune it

mrityu · October 16, 2021, 5:33am

I am asking if its “necessary” to do:-

for param in res_mod.parameters():
    param.requires_grad = False

if I want to “freeze” a part of a model. According to what I know, if we do this:-

optimizer_ft = optim.SGD(res_mod.fc.parameters(), lr=0.001, momentum=0.9)

then also we are only updating the weights of the ‘FC’ layer(or any other layer for that matter) and not the rest of the model. But yes, I agree it will be computationally expensive. I am asking this just to check if my understanding of how the optimizers in PyTorch work is correct or not.

OrielBanne · October 17, 2021, 5:08pm

yes - freezing any part of the model requires freezing the grads. otherwise, every optimizer stepyou make will change all the weights

KFrank · October 17, 2021, 11:06pm

Hi Mrityunjay!

Yes, you are correct. When you call optimizer_ft.step(), the
optimizer will only update the weights that were specified in that
optimizer.

Even, if other weights – not in the optimizer – have had their
gradients computed, those other weights will not be updated.

Yes. Just to confirm, computing gradients for other weights not
in the optimizer will add (unnecessary) computational cost. You
will pay the cost – mostly in memory – of building the “computation
graph” for those weights during the forward pass, and pay the
cost – mostly computational – of computing their gradients during
the backward pass. But then you throw those gradients away,
wasting that effort. (But doing so won’t break anything, except
maybe cause you to run out of memory.)

Hi Oriel!

No, if by “freezing the grads” you mean setting requires_grad = False
for them, this is not true. Even if you compute the gradients for various
weights (for which requires_grad = True), they will not be updated if
they are not “part of” the optimizer on which you call optimizer.step().

For example, it is perfectly possible to use one optimizer, optA, to
optimize one part, “part A,” of a model and use another, optB to
optimize another part, “part B.” You could compute the gradients
for all of your weights, both those in “part A” and “part B” of your
model, with a single backward pass. But when you call optA.step()
you will only update the weights in “part A,” leaving “part B” unchanged,
and only update the weights in “part B” when you call optB.step().

Best.

K. Frank

OrielBanne · October 20, 2021, 9:49am

Thanks Frank, that’s awesome!