According to the expected behavior of batchnorm, its output should be the same in eval and training modes if the running stats are equal. However, I do not get consistent outputs when the stats are the same. For instance, please consider the following toy example in which the outputs of two exact batchnorm modules are not the same although running stats are exactly the same.
import copy
import torch
import torch.nn as nn
bn1 = nn.BatchNorm2d(3)
bn2 = copy.deepcopy(bn1)
# create dummy input
img = torch.rand(size=[4, 3, 32, 32])
# set the first bn in training mode
bn1.train()
# compute a dummy output
pred1 = bn1(img).sum()
print('bn1 running mean:', bn1.running_mean)
print('bn1 running var:', bn1.running_var)
print('output1:', pred1.item(), '\n')
# set the second bn in training mode
bn2.train()
# update batchnorm stats
_ = bn2(img)
print('bn2 running mean:', bn2.running_mean)
print('bn2 running var:', bn2.running_var)
# set the bn to eval mode
bn2.eval()
pred2 = bn2(img).sum()
print('output2:', pred2.item())
No, that’s not true.
During training, the input activation will be normalized using the batch statistics (so the stats calculated from the input activation itself) and the running stats will be updated using the momentum, the current batch stats, and the old running stats.
During evaluation, the input activation will be normalized using the running stats.
Doesn’t matter what you mean I suppose, the original paper agrees with ptrblck as it should be of course:
BN intended behaviour:
Importantly, during inference (eval/testing) running_mean, running_std is used (because they want a deterministic output and to use estimates of the population statistics).
During training the batch statistics is used but a population statistic is estimated with running averages. I assume the reason batch_stats is used during training is to introduce noise that regularizes training (noise robustness)
The normalization of activations that depends on the mini-batch allows efficient training, but is neither necessary nor desirable during inference; we want the output to depend only on the input, deterministically. For this, once the network has been trained, we use the normalization
x? = ? x − E Var + ǫ
using the population, rather than mini-batch, statistics. Neglecting ǫ, these normalized activations have the same mean 0 and variance 1 as during training. We use the un- biased variance estimate Var =
the expectation is over trainingmini-batches ofsize mand
m−1 · EB[σ2 m
B], where
σ2
B are their sample variances. Using moving averages in-
stead, we can track the accuracy of a model as it trains.
Thank you, @ptrblck and @Brando_Miranda. I was missing the fact that during the training, feature maps are being normalized using their own batch statistics not the moving averages.