Inconsistent Batchnorm behavior in eval and training modes


According to the expected behavior of batchnorm, its output should be the same in eval and training modes if the running stats are equal. However, I do not get consistent outputs when the stats are the same. For instance, please consider the following toy example in which the outputs of two exact batchnorm modules are not the same although running stats are exactly the same.

import copy
import torch
import torch.nn as nn

bn1 = nn.BatchNorm2d(3)
bn2 = copy.deepcopy(bn1)

# create dummy input
img = torch.rand(size=[4, 3, 32, 32])

# set the first bn in training mode
# compute a dummy output
pred1 = bn1(img).sum()

print('bn1 running mean:', bn1.running_mean)
print('bn1 running var:', bn1.running_var)
print('output1:', pred1.item(), '\n')

# set the second bn in training mode
# update batchnorm stats
_ = bn2(img)

print('bn2 running mean:', bn2.running_mean)
print('bn2 running var:', bn2.running_var)

# set the bn to eval mode
pred2 = bn2(img).sum()
print('output2:', pred2.item())

Example output is:

bn1 running mean: tensor([0.0499, 0.0501, 0.0499])
bn1 running var: tensor([0.9081, 0.9082, 0.9083])
output1: -0.00052642822265625 

bn2 running mean: tensor([0.0499, 0.0501, 0.0499])
bn2 running var: tensor([0.9081, 0.9082, 0.9083])
output2: 5794.7978515625

It is expected to get equal outputs but they are drastically different. Can anyone help me understand what is causing the difference in the outputs?

No, that’s not true.
During training, the input activation will be normalized using the batch statistics (so the stats calculated from the input activation itself) and the running stats will be updated using the momentum, the current batch stats, and the old running stats.
During evaluation, the input activation will be normalized using the running stats.


What does it mean if “the running stats are equal”? @alldbi

related: How does one use the mean and std from training in Batch Norm?

Doesn’t matter what you mean I suppose, the original paper agrees with ptrblck as it should be of course:

BN intended behaviour:

  • Importantly, during inference (eval/testing) running_mean, running_std is used (because they want a deterministic output and to use estimates of the population statistics).
  • During training the batch statistics is used but a population statistic is estimated with running averages. I assume the reason batch_stats is used during training is to introduce noise that regularizes training (noise robustness)

ref: [1502.03167] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

The normalization of activations that depends on the mini-batch allows efficient training, but is neither necessary nor desirable during inference; we want the output to depend only on the input, deterministically. For this, once the network has been trained, we use the normalization
x? = ? x − E[x] Var[x] + ǫ
using the population, rather than mini-batch, statistics. Neglecting ǫ, these normalized activations have the same mean 0 and variance 1 as during training. We use the un- biased variance estimate Var[x] =
the expectation is over trainingmini-batches ofsize mand
m−1 · EB[σ2 m
B], where
B are their sample variances. Using moving averages in-
stead, we can track the accuracy of a model as it trains.

1 Like

Thank you, @ptrblck and @Brando_Miranda. I was missing the fact that during the training, feature maps are being normalized using their own batch statistics not the moving averages.