Inconsistent Batchnorm behavior in eval and training modes

Brando_Miranda · November 5, 2021, 6:49pm

What does it mean if “the running stats are equal”? @alldbi

related: How does one use the mean and std from training in Batch Norm?

Doesn’t matter what you mean I suppose, the original paper agrees with ptrblck as it should be of course:

BN intended behaviour:

Importantly, during inference (eval/testing) running_mean, running_std is used (because they want a deterministic output and to use estimates of the population statistics).

During training the batch statistics is used but a population statistic is estimated with running averages. I assume the reason batch_stats is used during training is to introduce noise that regularizes training (noise robustness)

ref: [1502.03167] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

The normalization of activations that depends on the mini-batch allows efficient training, but is neither necessary nor desirable during inference; we want the output to depend only on the input, deterministically. For this, once the network has been trained, we use the normalization
x? = ? x − E Var + ǫ
using the population, rather than mini-batch, statistics. Neglecting ǫ, these normalized activations have the same mean 0 and variance 1 as during training. We use the un- biased variance estimate Var =
the expectation is over trainingmini-batches ofsize mand
m−1 · EB[σ2 m
B], where
σ2
B are their sample variances. Using moving averages in-
stead, we can track the accuracy of a model as it trains.