# Batchnorm train with momentum=1 then eval, why are they different?

My question is: why are the output of a step of batchnorm with momentum=1 and the output of a subsequent eval step not identical.

``````import numpy as np
import torchvision

class BNNet(torch.nn.Module):
"""
Module with single BN layer
"""
def __init__(self):
super(BNNet, self).__init__()
self.bn = torch.nn.BatchNorm2d(2, momentum = 1.0)
def forward(self, x):
return self.bn(x)

def example():
model = BNNet()
model.train()
inp = Variable(10*torch.randn(1, 2, 10, 10))
# output after train forward pass
output1 = model(inp)
model.eval()
# output after eval forward pass
output2 = model(inp)
# the two outputs differ. The magnitude of their difference decreases as
# the window size (10x10 in this example) increases.
print("max abs diff = %.4f"%(np.max(np.abs(output1.detach().numpy()
- output2.detach().numpy()))))
``````

Because BN uses Besselâ€™s correction on variance in eval() time but not in training time.

Ref: paper https://arxiv.org/pdf/1502.03167.pdf

If you undo the correction in eval, i.e. `model.bn.running_var.mul_(99 / 100.)`, it should give you the same results. It also explains why the difference is smaller when you have more elements, because `n / (n-1)` becomes closer to 1.

1 Like

Thank you
Indeed scaling by `(window size - 1) / (window size)` shows this is so.

As for the correctness of using the Bessel correction on an activation map where the values are definitely not independent, thatâ€™s another story. I guess the â€ścorrect scalingâ€ť lies somewhere in

(batch size - 1) / (batch size) < â€ścorrect scalingâ€ť < ( batch size * window size - 1 ) * (batch size * window size ).