What does batchnorm's output depend on?

I have 2 batchnorm layers, both have same weight, same bias, same running_mean and same running_var. The first is a BatchNorm2D layer and second is a BatchNorm3D layer

I give an input x2d (1x128x64x64) to first BatchNorm2d layer and input x3d (1x128x5x64x64) to both layers. I ensure that

x3d[:,:,0,:,:] = x3d[:,:,1,:,:] = x3d[:,:,2,:,:] = x3d[:,:,3,:,:] = x3d[:,:,4,:,:]  = x2d

Thus they get same mean value (i.e.) :

(x2d.mean(dim=-1).mean(dim=-1)) == (x3d.mean(dim=-1).mean(dim=-1).mean(dim=-1))

I expected both batchnorm layers to give exact same output but it turns out that they were off by few percents!!
Any reason for this??? @ptrblck ?

How large is the difference?
If it’s approx. 1e-6 it could be due to the order of operations applied on the data.

PS: I don’t think it’s good idea to ping people directly, since it could discourage others to answer in this thread. :wink:

1 Like

it was about 2-3 avg%, sometimes even 10%.
And Probably repeated such batchnorms are leading to very different results. Any tip how to correct, because if all parameters and everything is same I should expect exact answer!

And sorry for the ping :sweat_smile: , was really frustrated figuring out the reason. Wont repeat :stuck_out_tongue: !

Just in case you want to check

Here is the data over which I want to perform batch norm, here is the 2d batch norm layer I am using and here is the 3d batchnorm layer I am using

bn2d = torch.load('bn2d.pth', map_location=lambda storage, loc: storage)
bn3d = torch.load('bn3d.pth', map_location=lambda storage, loc: storage)
data = torch.load('2D-data.pth', map_location=lambda storage, loc: storage)

> o2 = bn2d(data)
> o2
# Prints long tensor

> o3 = bn3d(data[:,:,None,:,:].expand(1,64,32,128,128))[:,:,0,:,:]
> o3
# Prints another long tensor similar to dimension of above one

> o2.norm()
# tensor(156.6110)
> o3.norm()
# tensor(157.0714)

> (o3*o2).sum()/(o2.norm() * o3.norm())

0.9985 is angle between the to vectors (matrices) in matrix space. It’s not very bad but yeah would love if it gets exact 1!!
I am debugging a network based on static image video so really need it!!

Thanks for the code! I created a small example script to show the difference between the 2D and 3D case.
I think it’s due to the different variance, since you have to divide by a larger number of elements in the 3D case.

The code is a bit ugly, but I think it will explain the effects:

# Init BatchNorm2d
bn2 = nn.BatchNorm2d(3)

# Create 2D tensor
tmp = torch.cat((torch.ones(1, 2, 1), torch.ones(1, 2, 1) * 2), 2)
x2 = torch.cat((torch.zeros(1, 2, 2), tmp, tmp * 2), 0)

# Calculate stats
x2_mean = x2.mean(-1).mean(-1)
num_elem2 = 4
x2_var_unbiased = ((x2 - x2_mean.view(1, 3, 1, 1))**2).sum(2).sum(2) / (num_elem2 - 1)
print('x2: ', x2)
print('x2 mean: ', x2_mean)
print('x2 var_unbiased: ', x2_var_unbiased)
print('bn2 running_mean: ', bn2.running_mean)
print('bn2 running_var: ', bn2.running_var)
print('Expected bn2 running_mean after forward pass: ', 
     bn2.running_mean * (1 - bn2.momentum) + x2_mean * bn2.momentum)
print('Expected bn2 running_var after forward pass: ', 
     bn2.running_var * (1 - bn2.momentum) + x2_var_unbiased * bn2.momentum)

# Perform forward pass on 2D data
output2 = bn2(x2)
print('output2: ', output2)
print('bn2 running mean after forward pass:', bn2.running_mean)
print('bn2 running var after forward pass:', bn2.running_var)

# Init BatchNorm3d
bn3 = nn.BatchNorm3d(3)

# Create 3D tensor from 2D
x3 = x2.unsqueeze(2).repeat(1, 1, 5, 1, 1)

# Calculate stats
x3_mean = x3.mean(-1).mean(-1).mean(-1)
num_elem3 = 5 * 4
x3_var_unbiased = ((x3 - x3_mean.view(1, 3, 1, 1, 1))**2).sum(2).sum(2).sum(2) / (num_elem3 - 1)
print('x3: ', x3)
print('x3 mean: ', x3_mean)
print('x3 var_unbiased: ', x3_var_unbiased)
print('bn3 running_mean: ', bn3.running_mean)
print('bn3 running_var: ', bn3.running_var)
print('Expected bn3 running_mean after forward pass: ', 
     bn3.running_mean * (1 - bn3.momentum) + x3_mean * bn3.momentum)
print('Expected bn3 running_var after forward pass: ', 
     bn3.running_var * (1 - bn3.momentum) + x3_var_unbiased * bn3.momentum)

# Perform forward pass on 3D data
output3 = bn3(x3)
#print('output3: ', output3)
print('bn3 running mean after forward pass:', bn3.running_mean)
print('bn3 running var after forward pass:', bn3.running_var)

Let me know, if this clears thing up! :slight_smile:

1 Like

Hey man firstly thanks for such detailed explanation, really respect :slight_smile: . Seriously pytorch forums are great :heart: !!

In batchnorm3d we divide by larger number of elements but in numerator as well we have larger number of terms.

But I was simply assuming the variance used by batchnorm as biased variance (that would have divided by num_elements), and my reasoning would have worked

x2_var_biased = ((x2 - x2_mean.view(1, 3, 1, 1))**2).sum(2).sum(2) / (num_elem2 )

print('Expected bn2 running_var after forward pass: ', 
 bn2.running_var * (1 - bn2.momentum) + x2_var_biased * bn2.momentum)
Expected bn2 running_var after forward pass:  tensor([[ 0.9000,  0.9250,  1.0000]])
x3_var_biased = ((x3 - x3_mean.view(1, 3, 1, 1, 1))**2).sum(2).sum(2).sum(2) / (num_elem3)

print('Expected bn3 running_var after forward pass: ', 
 bn3.running_var * (1 - bn3.momentum) + x3_var_biased * bn3.momentum)
Expected bn3 running_var after forward pass:  tensor([[ 0.9000,  0.9250,  1.0000]])

(After I removed subtracting 1 from num_elements in both batchnorm layers)

But with that unbiased estimates, scaling doesn’t work due to that extra 1!

Thanks for pointing that unbiased thing out, I wouldn’t have thought about it!!

_Also is there any way to use batchnorm with biased variances (without bessel correction), or I’ll have to build it customly? Couldnt find in documentation! _
(Starting a new thread this new question and closing this new thread link)

1 Like