Batchnorm1d vs Standardisation

As far as I understand, Batchnorm1d will do standardisation if we set affine=False, according to a related discussion: What is standard scale of BatchNorm1d?
and the implementation of barlow twins: github

However, I found there are some difference between Batchnorm1d and standardisation in a toy example:

x = torch.randn(3,2) 
'''
tensor([[-0.3176, -1.0842],
        [-1.1290, -0.0545],
        [-1.7591, -0.5150]])
'''
bn = nn.Batchnorm1d(2, affine = False)
y = bn(x)
'''
tensor([[ 1.2727, -1.2655],
        [-0.1024,  1.1794],
        [-1.1703,  0.0860]])
'''

m,std = x.mean(0), x.std(0)
'''
m:
tensor([-1.0686, -0.5512])
std:
tensor([0.7227, 0.5158])
'''
(x-m)/std
'''
tensor([[ 1.0392, -1.0333],
        [-0.0836,  0.9630],
        [-0.9556,  0.0702]])
'''
bn.state_dict()
'''
OrderedDict([('running_mean', tensor([-0.1069, -0.0551])),
             ('running_var', tensor([0.9522, 0.9266])),
             ('num_batches_tracked', tensor(1))])
'''
var = x.var(0)
m,std,var
'''
tensor([-1.0686, -0.5512])
tensor([0.7227, 0.5158])
tensor([0.5222, 0.2660])

We can see from the above toy example, the mean calculated using torch.mean() is larger than that in Batchnorm1d by a factor of 10, while the variance is very different.

So I am wondering why such a difference exists?

You can refer to this code showing how BatchNorm layers apply the normalization.
In your example the var calculation is wrong and you are also comparing the running_* stats to the actual batch stats, which is also wrong since your BatchNorm1d layer is in training mode.

1 Like

Hi @ptrblck thanks for your reply, I have verfied the difference between Batchnorm1d and mean variance normalisation is due to my calculation of the standard deviation. After using std = x.std(0, unbiased = False), I can get the same result as with Batchnorm1d. But I am wondering why we use unbiased = False in Batchnorm1d, doesn’t it make more sense with unbiased = True since we are using the sample standard deviation rather than that from population? I also found it states ‘We use the unbiased variance estimate … where the expectation is over training mini-batches of size m and are their sample variances.’ in the section 3.1 of the batchnorm paper