Why BatchNorm2d has learnable parameters as number of channels as compared to number of activations

I noticed that batchNorm2d with affine parameter true is using learnable parameters for each channel (64) instead of each input activation (64x32x32), I guess thats intentional to reduce number of parameters and I am missing something ?

Number of Input Channels = 64
module.bn1.weight: 0.6789193153381348
module.bn1.weight.size(): torch.Size([64])
module.bn1.weight.Gradients: 2.592428207397461

module.bn1.bias.: 0.7608728408813477
module.bn1.bias.size(): torch.Size([64])
module.bn1.bias.Gradients: 1.4683836698532104

I am using the following resent model from -->

That is intentional. It seems that batch norm seems to mean different things to different people when it comes to the specifics…

Best regards

Thomas

1 Like

Thanks @tom

I have another question, so if we have the learnable parameters are false, then the running_mean and running_variance are per channel or per activation then (I guess per channel) ?

Per channel. You can tell by the fact that you don’t actually provide dimensions beyond the number of channels to BN, so it is unaware of e.g. the “image” dimensions you feed through it.
If you do

bn = torch.nn.BatchNorm(3, affine=True)
print(bn.state_dict())

you also have proof that it has three-element vectors for all four state items. Part of the beauty of PyTorch is that you can easily poke the modules to see how they behave.

Best regards

Thomas

1 Like

Thanks a lot @tom !!
I appreciate your explanation.

To compute per channel running_mean, it computes the mean across all activations for that channel (including all the examples) OR just first mean across examples and then mean across activations ?