How accurate is the running mean and running average in batch normalization?

jmaronas · December 20, 2017, 11:47am

Hello.

I was wondering how accurate is the running average and running std that lot of people (including pytorch batch norm functions does)

i understand that for each batch the running average (r_avg) mean is computed as:

r_avg = r_avg0.1 + 0.9batch_mean

where batch_mean is the actual mean of the batch.

How will this estimator of the mean approximates the true mean?. That would be for all the training data compute the mean estimator:

avg = sum (all_data_values) / (all_data)

Remark that if I stop the training and compute the means for each training batch, mu_1, mu_2 (for batch 1, batch 2) and do the mean of this means that is not the true mean, that is (if N is the number of batches):

(mu_1+mu_2+…+mu_N )/ N \neq avg

I now this because once I had to implement the mean of a huge dataset that could not be kept in memory.

Viewing this estimator from a signal processing perspective:

r_avg = r_avg0.1 + 0.9batch_mean

It is nothing more that a low pass filter of the mean, that is, assuming our mean is noisy this kind of estimator will make a more stable mean (we can view the momentum in SGD like this, and it basically reduce the oscillations in the gradients). This make me think this is the actual reason, however i do not see if it is an appropriate estimator to see if training is going well (when looking at hyperparameters). I could also use the mean of my validation set.

Thanks.

SimonW · December 20, 2017, 9:44pm

It’s not the true mean. But calculating true mean is infeasible for larger networks and datasets.

I guess you can say this with the filter being geometric, but it doesn’t really attribute to understanding BN. In the end, it is just a running average as it is expensive to compute the population average. You can compute the population average, and people have tried that. From experiments, the consensus seem that be that there is no significant gain.

Using anything from val to train is breaking the purpose of val.

jmaronas · December 21, 2017, 11:01am

Thanks for your reply.

first I know it is not the true mean, that is why I ask how good that estimator approximates the true mean. It is true that it is infeasible to compute it, as we need to do a forward with all the training set for each parameter update.

The thing about the low pass filter is an alternative way of seeing it. What I mean is that this running average is making an smooth average estimate from different parameter updates, so we reduce peaks in the function (as a low pass filter does). It is the same with momentum and SGD.

Finally what I mean using the validation is normalize using the validation statistics only when BN takes places (the input is normalized using training statistics). I now this it’s not the best way, but as long as i do not now how this estimator approximate the true mean and true std, maybe using the mean and variance from my validation set to normalize each layer could be a quick approximation.

SimonW · January 19, 2018, 9:22pm

I don’t know what exact measure of goodness you are looking for. Yes it is a smoothed moving average with momentum. If you want a theoretical guarantee, then unfortunately you can definitely construct examples where this approximation can be bad. That being said, it is usually proved to be sufficient in practice. So people don’t tend to worry about it.

They don’t even need to be close to true mean and var, as long as they give good results. Afterall, it’s just a smart “trick” for training networks. If you are worried about this, I think you should more worried about why BN work, what internal covariate shifts are, and what is causing these shifts.

jmaronas · January 30, 2018, 2:31pm

I was just interested in knowing if it is something experimental or there was a exact reason for it.

SimonW · January 30, 2018, 4:22pm

There is definitely reason for it. When you have a deep network of L layers with BN at almost every linear layer, and a large dataset of N samples. To get the accurate dataset statistics for each BN layer, you would need to activate the network O(NL) times, which is too much. Using running estimators to estimate the statistics of a dynamic (but converging) system is cheap and appears to give good results in the case of BN. So it is preferred in this scenario.

jmaronas · January 31, 2018, 9:35am

yes totally agree. I only do a big forward with all the dataset when the network is finished to run the test. For validation I always use running estimates.

brendanruff · February 28, 2018, 6:08pm

Hi. For the last year and half I have been using a faster and lower cost normalisation that works perfectly with single sample SGD i.e. minibatchsize=1. This allows much larger deep nets to be trained. Please see this paper https://arxiv.org/abs/1706.03907 , Deep Control a simple automatic gain control for memory efficient and high performance training of deep convolutional neural networks.

note from moderators: read this post with skepticism. trusted users have reported that this paper is the same as a known and obvious technique, but upsells it with unknown intent

tom · March 1, 2018, 12:03pm

I’m not sure that many people have the need for an almost trivial instance norm variation that has dubiuos patent claims attached to it.
It just seems to invite trouble and advertising it here has me scratching my head whether it should be considered spam.

brendanruff · March 1, 2018, 1:40pm

Hi Tom

Not in the slightest Spam, and the patent is there mostly to protect the date and since is now public (i.e. license free).

The technique is in production in automotive research and explains which part of batch normalisation is actually useful (the mean subtraction and scaling) but that minibatches per se are not useful though not excluded.

The option is there for single sample SGD that saves a lot on memory footprint and more than a year’s use of the AGC technique suggests that single sample is the most accurate and allows me to get top results on the automotive datasets I use beating all other techniques, so there must be something useful

warmspringwinds · November 24, 2018, 9:56pm

If someone still needs this, we wrote up a small script to compute a population statistics:

github.com

warmspringwinds/pytorch-segmentation-detection/blob/master/pytorch_segmentation_detection/utils/batchnorm.py

import torch
import torch.nn as nn


## A module dedicated to computing the true population statistics
## after the training is done following the original Batch norm paper

# Example of usage:

# Note: you might want to traverse the dataset a couple of times to get
# a better estimate of the population statistics
# Make sure your trainloader has shuffle=True and drop_last=True

# net.apply(adjust_bn_layers_to_compute_populatin_stats)
# for i in range(10): 
#     with torch.no_grad():
#         for batch_idx, (inputs, targets) in enumerate(trainloader):
#             _ = net(inputs.cuda())
# net.apply(restore_original_settings_of_bn_layers)

This file has been truncated. show original

monta · August 10, 2021, 3:25am

Decaying Average

decayed_average

Unuu_U · May 24, 2025, 4:41pm

Could you elaborate how calculating true mean is infeasible? I believe we can do something like:

bnmean_train = (bnmean_train*current_batch_num+bnmean_batch_num)/(current_batch_num+1)

Edit: To clarify, the procedure above is not intended to be used during training. I think decaying average is used during training because weights are being updated. But just calculating the true mean should be possible by itself.