Hi everyone, I have a question regarding BatchNorm2d.
What changes happen in the model if during training I set requires_grad=False on BatchNorm2d layers?
I read that running_mean and running_var are buffers and do not require gradients. Is it true? If so, what will be the difference in BatchNorm2d if I set requires_grad=False opposed to requires_grad=True?
Yes, that’s true as the running stats will be updated in each forward pass if the module is set to training mode using the batch statistics.
By default batchnorm layers will contain trainable parameters (weight and bias), which will get gradients and will thus be updated. Setting their requires_grad attribute to False would freeze these parameters.
However, do these parameters (weight and bias) influence the output of the BatchNorm2d layer or they are just there to create consistency among layers’ implementations?
Because looking at the formula, BatchNorm2d requires only the running stats and the expected mean/variance and there is no weight and bias
Beta and gamma are weights and bias.
In training time with forward pass E(x) and Var(x) are estimated using batch samples.
In test time, using model.eval() will change the behavior of forward to use running means instead of E(x) and Var(x).
Hi I fine-tuned a Wide-ResNet setting only last fully connected layer, and all other layers remain same since I set requires_grad=False for these layers. Then the output from a batchNorm layer of the original pre-trained resnet and one fine-tuned (only the last FC layer) should be same(since I set requires_grad=False for all other layers) . However I get different outputs just after batchNorm layer for the same input. Do you know why is that:?
Setting requires_grad = False will freeze the trainable, affine parameters, but will not change the running stats updates as explained in my previous post. Call .eval() on these layers to use the fixed running stats instead.
Yes, you can call eval() on any layer. Nota that it won’t freeze the trainable parameters but will change the behavior of some layers, such as batchnorm. Calling eval() on the last linear layer won’t have any effect.
So does this means if I set model.eval() at test time(when features are extracted) 2 models(models with same affine parameters but different batchnorm stats) should give same results since it discard E(x) and Var(x) and calculate these stats according to test set?