In my network, there are two models A and B. It runs like:
a = A(input)
b = B(a)
I want to freeze model A, only train model B. I want to know, are these enough to fix model A which has batchnorm layers and dropout layers:
for param in A.parameters():
param.requires_grad = False
optimizer = torch.optim.Adam(B.parameters()……)
During my test, I found this code is also needed:
When I tell it to my friends, we both are confused. We think parameters in batchnorm layers will not be changed if we don’t optimize them (that means use an optimizer).
Batchnorm layers perform differently in eval and train mode. How about their parameters? Do the parameters change along with each prediction?
There are three things to batchnorm
- (Optional) Parameters (weight and bias aka scale and location aka gamma and beta) that behave like those of a linear layer except they are per-channel. Those are trained using gradient descent and by disabling gradients you inhibit that they are updated.
- There are (optional again) running mean and variance that are a form of average over the batch statistics for each channel. These are not parameters, but buffers, and are updated during the forward pass when batch norm is used in training mode. They don’t require grad. These don’t affect the outputs in training mode, but do change while you feed data through them in training mode.
- In training mode, the batch statistics are taken and each channel is mean/variance standardized. In eval mode, the the running mean and variance are used in place of the batch statistics to “standardize” the input.
Thanks a lot! I know what happend……
Model A is actually loaded from state_dict_A. Then I train model B. Finally I save model B as new_B.
After that, I load state_dict_A, and new_B. As a result, the prediction changes because I haven’t completely fixed model A.
I will never forget to use model.eval() when I have to freeze it.