Reduce training runtime when set requires_grad=False for intermediate layers

leigao97 · September 29, 2022, 1:48am

If I freeze the beginning a few consecutive layers by setting requires_grad=False, the autograd engine will exclude those layers from the graph, so the computation and runtime during model training will be reduced. However, if I freeze some intermediate layers, the autograde engine will still compute the full graph. Thus, although frozen layers doesn’t get updated, the gradients with respect to the weights will still be computed and no runtime saving. Is my understanding correct? How to save computation in the second case? Is it possible that I can skip the computation of dL/dW? The layer that I am trying to freeze is conv2d, and the model I am doing experiment is resnet.

Reference:

leigao97 · October 2, 2022, 10:44pm

For the first case, here is my code snippet to freeze the first 50 parameters in resnet34 model:

model = torchvision.models.resnet34()

for i, (name, param) in enumerate(model.named_parameters()):
    param.requires_grad = False
    if i > 50:
        break

optimizer = optim.SGD(filter(lambda p: p.requires_grad, model.parameters()), lr=0.1)

For the second case, here is my code snippet to freeze all convolutional parameters in resnet34 model:

model = torchvision.models.resnet34()

for i, (name, param) in enumerate(model.named_parameters()):
    if param.shape == 4:
        param.requires_grad = False

optimizer = optim.SGD(filter(lambda p: p.requires_grad, model.parameters()), lr=0.1)

The runtime for training the first case model was reduced significantly compared with full model training, but for the second case, there is no runtime difference compared with full model training.