- Yes and this can be verified using this small code snippet:
import torch
import torch.nn as nn
model = nn.Sequential(
nn.Conv2d(3, 3, 3),
nn.Conv2d(3, 3, 3),
nn.Conv2d(3, 3, 3)
).cuda()
#model[1].weight.requires_grad = False
#model[1].bias.requires_brad = False
x = torch.randn(1, 3, 24, 24).cuda()
out = model(x)
out.mean().backward()
In the use case where all layers are trainable you would see 2 dgrad
and 3 wgrad
kernels:
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- -------- -------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
21.2 16544 2 8272.0 8272.0 8256 8288 22.6 void cudnn::cnn::dgrad2d_grouped_direct_kernel<float, int, float, float, (bool)0, (bool)1, (int)0, …
20.9 16288 3 5429.3 4896.0 4768 6624 1036.6 void cudnn::cnn::wgrad2d_grouped_direct_kernel<(bool)0, (bool)1, int, float, float, float>(cudnn::c…
19.8 15456 3 5152.0 4928.0 4768 5760 532.6 void implicit_convolve_sgemm<float, float, (int)128, (int)5, (int)5, (int)3, (int)3, (int)3, (int)1…
If you freeze the middle layer, the one wgrad
kernel will be missing as the weights do not get any gradient. The dgrad
kernel is still called as the gradients needs to be backpropagated to the first conv layer:
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- -------- -------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
22.2 15968 2 7984.0 7984.0 7968 8000 22.6 void cudnn::cnn::dgrad2d_grouped_direct_kernel<float, int, float, float, (bool)0, (bool)1, (int)0, …
21.7 15552 3 5184.0 4896.0 4768 5888 613.0 void implicit_convolve_sgemm<float, float, (int)128, (int)5, (int)5, (int)3, (int)3, (int)3, (int)1…
15.2 10944 2 5472.0 5472.0 4736 6208 1040.9 void cudnn::cnn::wgrad2d_grouped_direct_kernel<(bool)0, (bool)1, int, float, float, float>(cudnn::c…
-
Yes and is also more “explicit” assuming you never want to train these frozen parameters.
-
Yes. Not only will the gradients be computed but will also be accumulated to the .grad
attribute, which will launch another kernel.
-
This depends on the actual use case and e.g. if this (now) frozen parameter was updated before, if the optimizer uses internal running stats, thus if these stats were already populated for the frozen parameter, etc. In such a case even frozen parameters could see an update if you are setting the gradients to zero (i.e. are not using zero_grad(set_to_none=True)
which is the default in newer PyTorch releases) and if the optimizer can use its internal state to update this parameter.