Is it necessary to disable gradient computation in fine-tuning?

Hey.

Generally, when we fine-tune a classifier by keeping a pre-trained model as a feature-extractor only, we set the requires_grad = False for the pre-trained block and only train the newly added FC layer.
For eg., see the code snippet below:-

# Setting up the model
# Note that the parameters of imported models are set to requires_grad=True by default

res_mod = models.resnet34(pretrained=True)
for param in res_mod.parameters():
    param.requires_grad = False

num_ftrs = res_mod.fc.in_features
res_mod.fc = nn.Linear(num_ftrs, 2)

res_mod = res_mod.to(device)
criterion = nn.CrossEntropyLoss()

# Here's another change: instead of all parameters being optimized
# only the params of the final layers are being optimized

optimizer_ft = optim.SGD(res_mod.fc.parameters(), lr=0.001, momentum=0.9)

exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=7, gamma=0.1)

My question is, is it necessary to set requires_grad=False in fine-tuning because we are anyways specifying the parameters in optimizer_ft which needs to update i.e. the last FC layer params?

I know it will be a computational disaster and shouldn’t be done this way, but I am just asking out of curiosity.

are you trying to freeze a pre trained net and train an FC layer for, say, classification? so the net gives you features it learned elsewhere and FC learns the classification task?
fine tuning includes re-train to a model that is pre-trained, you should not freeze a model if you would like to fine tune it

I am asking if its “necessary” to do:-

for param in res_mod.parameters():
    param.requires_grad = False

if I want to “freeze” a part of a model. According to what I know, if we do this:-

optimizer_ft = optim.SGD(res_mod.fc.parameters(), lr=0.001, momentum=0.9)

then also we are only updating the weights of the ‘FC’ layer(or any other layer for that matter) and not the rest of the model. But yes, I agree it will be computationally expensive. I am asking this just to check if my understanding of how the optimizers in PyTorch work is correct or not.