For freezing certain layers, why do I need a two step process?

ptrblck · March 19, 2023, 8:42pm

Yes and this can be verified using this small code snippet:

import torch
import torch.nn as nn


model = nn.Sequential(
        nn.Conv2d(3, 3, 3),
        nn.Conv2d(3, 3, 3),
        nn.Conv2d(3, 3, 3)
).cuda()

#model[1].weight.requires_grad = False
#model[1].bias.requires_brad = False

x = torch.randn(1, 3, 24, 24).cuda()
out = model(x)
out.mean().backward()

In the use case where all layers are trainable you would see 2 dgrad and 3 wgrad kernels:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     21.2            16544          2    8272.0    8272.0      8256      8288         22.6  void cudnn::cnn::dgrad2d_grouped_direct_kernel<float, int, float, float, (bool)0, (bool)1, (int)0, …
     20.9            16288          3    5429.3    4896.0      4768      6624       1036.6  void cudnn::cnn::wgrad2d_grouped_direct_kernel<(bool)0, (bool)1, int, float, float, float>(cudnn::c…
     19.8            15456          3    5152.0    4928.0      4768      5760        532.6  void implicit_convolve_sgemm<float, float, (int)128, (int)5, (int)5, (int)3, (int)3, (int)3, (int)1…

If you freeze the middle layer, the one wgrad kernel will be missing as the weights do not get any gradient. The dgrad kernel is still called as the gradients needs to be backpropagated to the first conv layer:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     22.2            15968          2    7984.0    7984.0      7968      8000         22.6  void cudnn::cnn::dgrad2d_grouped_direct_kernel<float, int, float, float, (bool)0, (bool)1, (int)0, …
     21.7            15552          3    5184.0    4896.0      4768      5888        613.0  void implicit_convolve_sgemm<float, float, (int)128, (int)5, (int)5, (int)3, (int)3, (int)3, (int)1…
     15.2            10944          2    5472.0    5472.0      4736      6208       1040.9  void cudnn::cnn::wgrad2d_grouped_direct_kernel<(bool)0, (bool)1, int, float, float, float>(cudnn::c…

Yes and is also more “explicit” assuming you never want to train these frozen parameters.
Yes. Not only will the gradients be computed but will also be accumulated to the .grad attribute, which will launch another kernel.
This depends on the actual use case and e.g. if this (now) frozen parameter was updated before, if the optimizer uses internal running stats, thus if these stats were already populated for the frozen parameter, etc. In such a case even frozen parameters could see an update if you are setting the gradients to zero (i.e. are not using zero_grad(set_to_none=True) which is the default in newer PyTorch releases) and if the optimizer can use its internal state to update this parameter.