For freezing certain layers, why do I need a two step process?

I see almost all responses (tutorial, discussion) on training part of a network to include these 2 steps

  1. Set target network parameters to requires_grad=False
  2. Pass only non-target parameters to the optimiser

Doing any one of the above two can achieve the effect of not updating the target layers

My take on the comparison is as follows.

  1. Setting requires_grad=False requires lower compute as target gradients are not computed.
  2. Excluding from the optimiser makes the optim.step() faster as it does not have to loop over all param groups.
  3. Setting requires_grad=True and skipping the layers in the optimiser still leads to gradient computation. Maybe an eventual gradient overflow due to accumulation?
  4. Setting requires_grad=False and still passing to optimiser has some impact of optimiser.step() call

Am I missing anything?

  1. Yes and this can be verified using this small code snippet:
import torch
import torch.nn as nn


model = nn.Sequential(
        nn.Conv2d(3, 3, 3),
        nn.Conv2d(3, 3, 3),
        nn.Conv2d(3, 3, 3)
).cuda()

#model[1].weight.requires_grad = False
#model[1].bias.requires_brad = False

x = torch.randn(1, 3, 24, 24).cuda()
out = model(x)
out.mean().backward()

In the use case where all layers are trainable you would see 2 dgrad and 3 wgrad kernels:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     21.2            16544          2    8272.0    8272.0      8256      8288         22.6  void cudnn::cnn::dgrad2d_grouped_direct_kernel<float, int, float, float, (bool)0, (bool)1, (int)0, …
     20.9            16288          3    5429.3    4896.0      4768      6624       1036.6  void cudnn::cnn::wgrad2d_grouped_direct_kernel<(bool)0, (bool)1, int, float, float, float>(cudnn::c…
     19.8            15456          3    5152.0    4928.0      4768      5760        532.6  void implicit_convolve_sgemm<float, float, (int)128, (int)5, (int)5, (int)3, (int)3, (int)3, (int)1…

If you freeze the middle layer, the one wgrad kernel will be missing as the weights do not get any gradient. The dgrad kernel is still called as the gradients needs to be backpropagated to the first conv layer:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     22.2            15968          2    7984.0    7984.0      7968      8000         22.6  void cudnn::cnn::dgrad2d_grouped_direct_kernel<float, int, float, float, (bool)0, (bool)1, (int)0, …
     21.7            15552          3    5184.0    4896.0      4768      5888        613.0  void implicit_convolve_sgemm<float, float, (int)128, (int)5, (int)5, (int)3, (int)3, (int)3, (int)1…
     15.2            10944          2    5472.0    5472.0      4736      6208       1040.9  void cudnn::cnn::wgrad2d_grouped_direct_kernel<(bool)0, (bool)1, int, float, float, float>(cudnn::c…
  1. Yes and is also more “explicit” assuming you never want to train these frozen parameters.

  2. Yes. Not only will the gradients be computed but will also be accumulated to the .grad attribute, which will launch another kernel.

  3. This depends on the actual use case and e.g. if this (now) frozen parameter was updated before, if the optimizer uses internal running stats, thus if these stats were already populated for the frozen parameter, etc. In such a case even frozen parameters could see an update if you are setting the gradients to zero (i.e. are not using zero_grad(set_to_none=True) which is the default in newer PyTorch releases) and if the optimizer can use its internal state to update this parameter.

1 Like