In an attempt to understand how BatchNorm1d works in PyTorch, I tried to match the output of BatchNorm1d operation on a 2D tensor with manually normalizing it. The manual output seems to be scaled down by a factor of 0.9747. Here’s the code (note that affine is set to false):
import torch
import torch.nn as nn
from torch.autograd import Variable
X = torch.randn(20,100) * 5 + 10
X = Variable(X)
B = nn.BatchNorm1d(100, affine=False)
y = B(X)
mu = torch.mean(X[:,1])
var_ = torch.var(X[:,1])
sigma = torch.sqrt(var_ + 1e-5)
x = (X[:,1] - mu)/sigma
#the ratio below should be equal to one
print(x.data / y[:,1].data )
Output is:
0.9747
0.9747
0.9747
....
Doing the same thing for BatchNorm2d works without any issues. How does BatchNorm1d calculate its output?
Just for clarification: this should apply to BatchNorm2d (and 3d) as well, right? I guess since the effect of Bessel’s correction gets less significant as number of dimensions increases, I didn’t see any discrepancy for BatchNorm2d.
Thank you for your answer. A follow-up question: how does BatchNorm1d calculate variance for 3D data? I noticed that the variance of normalized 3d data is nowhere near 1. Here’s what I did:
BN applies an affine transform. So you want to set the affine scaling weight to 1 before getting Y, i.e. B2.weight.data.fill_(1).
Use biased version of variance.
BN normalizes for data within each channel. So you should calculate Y's variance for each channel, rather than for all data in Y.
Altogether, this gives correct result:
X3 = torch.randn(150,20,100) * 2 + 4
X3 = Variable(X3)
B2 = nn.BatchNorm1d(20)
B2.weight.data.fill_(1)
Y = B2(X3)
Y = Y.transpose(0, 1).contiguous().view(20, -1) # put data for each channel in the second dimension
print(Y.var(dim=-1, unbiased=False))