Freeze weights or constraint the optimizer?


Consider the following 2-layer NN:

model = nn.Sequential(OrderedDict([
          ('1', nn.Linear(3,3)),
          ('2', nn.Linear(3,3)),

Suppose that I want to freeze the second layer, and train only the first layer.
I think there are two methods to achieve this

  1. Set ‘require_grad’ of the second layer to False then train
for param in model[1].parameters():
    param.requires_grad = False
optimizer = optim.AdamW(model.parameters(), lr=0.0001)
  1. Constraint the optimizer to work with parameters of the first layer:

optimizer = optim.AdamW(model[0].parameters(), lr=0.0001)

Both methods yield the same value for the parameters of the first layer.
So, which method is preferred? or should we combine both:

for param in model[1].parameters():
    param.requires_grad = False
optimizer = optim.AdamW(model[0].parameters(), lr=0.0001)

One thing that would speak in favor of explicitly setting requires_grad = False for frozen parameters is the decrease in overhead. By doing so you effectively tell torch to not accumulate gradients for those leaf nodes during the backward pass, whereas with option 2 they are computed but unused.

The autograd documentation is also straightforward on this subject:

Setting requires_grad should be the main way you control which parts of the model are part of the gradient computation, for example, if you need to freeze parts of your pretrained model during model fine-tuning.