I noticed that batchNorm2d with affine parameter true is using learnable parameters for each channel (64) instead of each input activation (64x32x32), I guess thats intentional to reduce number of parameters and I am missing something ?
Number of Input Channels = 64
module.bn1.weight: 0.6789193153381348
module.bn1.weight.size(): torch.Size([64])
module.bn1.weight.Gradients: 2.592428207397461
I have another question, so if we have the learnable parameters are false, then the running_mean and running_variance are per channel or per activation then (I guess per channel) ?
Per channel. You can tell by the fact that you don’t actually provide dimensions beyond the number of channels to BN, so it is unaware of e.g. the “image” dimensions you feed through it.
If you do
you also have proof that it has three-element vectors for all four state items. Part of the beauty of PyTorch is that you can easily poke the modules to see how they behave.
Thanks a lot @tom !!
I appreciate your explanation.
To compute per channel running_mean, it computes the mean across all activations for that channel (including all the examples) OR just first mean across examples and then mean across activations ?