As far as I understand, Batchnorm1d will do standardisation if we set affine=False
, according to a related discussion: What is standard scale of BatchNorm1d?
and the implementation of barlow twins: github
However, I found there are some difference between Batchnorm1d and standardisation in a toy example:
x = torch.randn(3,2)
'''
tensor([[-0.3176, -1.0842],
[-1.1290, -0.0545],
[-1.7591, -0.5150]])
'''
bn = nn.Batchnorm1d(2, affine = False)
y = bn(x)
'''
tensor([[ 1.2727, -1.2655],
[-0.1024, 1.1794],
[-1.1703, 0.0860]])
'''
m,std = x.mean(0), x.std(0)
'''
m:
tensor([-1.0686, -0.5512])
std:
tensor([0.7227, 0.5158])
'''
(x-m)/std
'''
tensor([[ 1.0392, -1.0333],
[-0.0836, 0.9630],
[-0.9556, 0.0702]])
'''
bn.state_dict()
'''
OrderedDict([('running_mean', tensor([-0.1069, -0.0551])),
('running_var', tensor([0.9522, 0.9266])),
('num_batches_tracked', tensor(1))])
'''
var = x.var(0)
m,std,var
'''
tensor([-1.0686, -0.5512])
tensor([0.7227, 0.5158])
tensor([0.5222, 0.2660])
We can see from the above toy example, the mean calculated using torch.mean()
is larger than that in Batchnorm1d
by a factor of 10, while the variance is very different.
So I am wondering why such a difference exists?