Batchnorm train with momentum=1 then eval, why are they different?

My question is: why are the output of a step of batchnorm with momentum=1 and the output of a subsequent eval step not identical.

import numpy as np
from torch.autograd import Variable
import torchvision

class BNNet(torch.nn.Module):
    """
    Module with single BN layer
    """
    def __init__(self):
        super(BNNet, self).__init__()
        self.bn = torch.nn.BatchNorm2d(2, momentum = 1.0)
    def forward(self, x):
        return self.bn(x)

def example():
    model = BNNet()
    model.train()
    inp = Variable(10*torch.randn(1, 2, 10, 10))
    # output after train forward pass
    output1 = model(inp)
    model.eval()
    # output after eval forward pass
    output2 = model(inp)
    # the two outputs differ. The magnitude of their difference decreases as
    # the window size (10x10 in this example) increases. 
    print("max abs diff = %.4f"%(np.max(np.abs(output1.detach().numpy()
        - output2.detach().numpy()))))

Because BN uses Bessel’s correction on variance in eval() time but not in training time. :slight_smile:

Ref: paper https://arxiv.org/pdf/1502.03167.pdf

If you undo the correction in eval, i.e. model.bn.running_var.mul_(99 / 100.), it should give you the same results. It also explains why the difference is smaller when you have more elements, because n / (n-1) becomes closer to 1.

1 Like

Thank you :slight_smile:
Indeed scaling by (window size - 1) / (window size) shows this is so.

As for the correctness of using the Bessel correction on an activation map where the values are definitely not independent, that’s another story. I guess the “correct scaling” lies somewhere in

(batch size - 1) / (batch size) < “correct scaling” < ( batch size * window size - 1 ) * (batch size * window size ).

Anyway, question answered :+1:

1 Like