What does requires_grad=False on BatchNorm2d perform?

Hi everyone, I have a question regarding BatchNorm2d.

What changes happen in the model if during training I set requires_grad=False on BatchNorm2d layers?
I read that running_mean and running_var are buffers and do not require gradients. Is it true? If so, what will be the difference in BatchNorm2d if I set requires_grad=False opposed to requires_grad=True?

Thanks in advance!

Yes, that’s true as the running stats will be updated in each forward pass if the module is set to training mode using the batch statistics.

By default batchnorm layers will contain trainable parameters (weight and bias), which will get gradients and will thus be updated. Setting their requires_grad attribute to False would freeze these parameters.

1 Like

Ok thanks!

However, do these parameters (weight and bias) influence the output of the BatchNorm2d layer or they are just there to create consistency among layers’ implementations?
Because looking at the formula, BatchNorm2d requires only the running stats and the expected mean/variance and there is no weight and bias

Thanks in advance!

see : batch norm.
image
Beta and gamma are weights and bias.
In training time with forward pass E(x) and Var(x) are estimated using batch samples.
In test time, using model.eval() will change the behavior of forward to use running means instead of E(x) and Var(x).

As @mMagmer explained, gamma=weight and beta=bias will be used in the default setup unless you are creating the batchnorm layers with affine=False.

Ok now everything is clear, thank you both! @mMagmer @ptrblck

Hi I fine-tuned a Wide-ResNet setting only last fully connected layer, and all other layers remain same since I set requires_grad=False for these layers. Then the output from a batchNorm layer of the original pre-trained resnet and one fine-tuned (only the last FC layer) should be same(since I set requires_grad=False for all other layers) . However I get different outputs just after batchNorm layer for the same input. Do you know why is that:?

Setting requires_grad = False will freeze the trainable, affine parameters, but will not change the running stats updates as explained in my previous post. Call .eval() on these layers to use the fixed running stats instead.

1 Like

Can I call .eval() for only selected layers? since I need to train last FC layer.

Yes, you can call eval() on any layer. Nota that it won’t freeze the trainable parameters but will change the behavior of some layers, such as batchnorm. Calling eval() on the last linear layer won’t have any effect.

So does this means if I set model.eval() at test time(when features are extracted) 2 models(models with same affine parameters but different batchnorm stats) should give same results since it discard E(x) and Var(x) and calculate these stats according to test set?

No, since during eval the running stats are used to normalize the input activation.