Why BatchNorm2d has learnable parameters as number of channels as compared to number of activations

aSingh · March 13, 2018, 5:49am

I noticed that batchNorm2d with affine parameter true is using learnable parameters for each channel (64) instead of each input activation (64x32x32), I guess thats intentional to reduce number of parameters and I am missing something ?

Number of Input Channels = 64
module.bn1.weight: 0.6789193153381348
module.bn1.weight.size(): torch.Size([64])
module.bn1.weight.Gradients: 2.592428207397461

module.bn1.bias.: 0.7608728408813477
module.bn1.bias.size(): torch.Size([64])
module.bn1.bias.Gradients: 1.4683836698532104

I am using the following resent model from -->

github.com

kuangliu/pytorch-cifar/blob/master/models/resnet.py

'''ResNet in PyTorch.

For Pre-activation ResNet, see 'preact_resnet.py'.

Reference:
[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
    Deep Residual Learning for Image Recognition. arXiv:1512.03385
'''
import torch
import torch.nn as nn
import torch.nn.functional as F

from torch.autograd import Variable


class BasicBlock(nn.Module):
    expansion = 1

    def __init__(self, in_planes, planes, stride=1):
        super(BasicBlock, self).__init__()

This file has been truncated. show original

tom · March 13, 2018, 12:36pm

That is intentional. It seems that batch norm seems to mean different things to different people when it comes to the specifics…

Best regards

Thomas

aSingh · March 13, 2018, 7:41pm

Thanks @tom

I have another question, so if we have the learnable parameters are false, then the running_mean and running_variance are per channel or per activation then (I guess per channel) ?

tom · March 13, 2018, 10:27pm

Per channel. You can tell by the fact that you don’t actually provide dimensions beyond the number of channels to BN, so it is unaware of e.g. the “image” dimensions you feed through it.
If you do

bn = torch.nn.BatchNorm(3, affine=True)
print(bn.state_dict())

you also have proof that it has three-element vectors for all four state items. Part of the beauty of PyTorch is that you can easily poke the modules to see how they behave.

Best regards

Thomas

aSingh · March 14, 2018, 1:10am

Thanks a lot @tom !!
I appreciate your explanation.

To compute per channel running_mean, it computes the mean across all activations for that channel (including all the examples) OR just first mean across examples and then mean across activations ?