If I freeze the beginning a few consecutive layers by setting requires_grad=False, the autograd engine will exclude those layers from the graph, so the computation and runtime during model training will be reduced. However, if I freeze some intermediate layers, the autograde engine will still compute the full graph. Thus, although frozen layers doesn’t get updated, the gradients with respect to the weights will still be computed and no runtime saving. Is my understanding correct? How to save computation in the second case? Is it possible that I can skip the computation of dL/dW? The layer that I am trying to freeze is conv2d, and the model I am doing experiment is resnet.
For the first case, here is my code snippet to freeze the first 50 parameters in resnet34 model:
model = torchvision.models.resnet34()
for i, (name, param) in enumerate(model.named_parameters()):
param.requires_grad = False
if i > 50:
break
optimizer = optim.SGD(filter(lambda p: p.requires_grad, model.parameters()), lr=0.1)
For the second case, here is my code snippet to freeze all convolutional parameters in resnet34 model:
model = torchvision.models.resnet34()
for i, (name, param) in enumerate(model.named_parameters()):
if param.shape == 4:
param.requires_grad = False
optimizer = optim.SGD(filter(lambda p: p.requires_grad, model.parameters()), lr=0.1)
The runtime for training the first case model was reduced significantly compared with full model training, but for the second case, there is no runtime difference compared with full model training.