I was wondering how accurate is the running average and running std that lot of people (including pytorch batch norm functions does)

i understand that for each batch the running average (r_avg) mean is computed as:

r_avg = r_avg0.1 + 0.9batch_mean

where batch_mean is the actual mean of the batch.

How will this estimator of the mean approximates the true mean?. That would be for all the training data compute the mean estimator:

avg = sum (all_data_values) / (all_data)

Remark that if I stop the training and compute the means for each training batch, mu_1, mu_2 (for batch 1, batch 2) and do the mean of this means that is not the true mean, that is (if N is the number of batches):

(mu_1+mu_2+â€¦+mu_N )/ N \neq avg

I now this because once I had to implement the mean of a huge dataset that could not be kept in memory.

Viewing this estimator from a signal processing perspective:

r_avg = r_avg0.1 + 0.9batch_mean

It is nothing more that a low pass filter of the mean, that is, assuming our mean is noisy this kind of estimator will make a more stable mean (we can view the momentum in SGD like this, and it basically reduce the oscillations in the gradients). This make me think this is the actual reason, however i do not see if it is an appropriate estimator to see if training is going well (when looking at hyperparameters). I could also use the mean of my validation set.

Itâ€™s not the true mean. But calculating true mean is infeasible for larger networks and datasets.

I guess you can say this with the filter being geometric, but it doesnâ€™t really attribute to understanding BN. In the end, it is just a running average as it is expensive to compute the population average. You can compute the population average, and people have tried that. From experiments, the consensus seem that be that there is no significant gain.

Using anything from val to train is breaking the purpose of val.

first I know it is not the true mean, that is why I ask how good that estimator approximates the true mean. It is true that it is infeasible to compute it, as we need to do a forward with all the training set for each parameter update.

The thing about the low pass filter is an alternative way of seeing it. What I mean is that this running average is making an smooth average estimate from different parameter updates, so we reduce peaks in the function (as a low pass filter does). It is the same with momentum and SGD.

Finally what I mean using the validation is normalize using the validation statistics only when BN takes places (the input is normalized using training statistics). I now this itâ€™s not the best way, but as long as i do not now how this estimator approximate the true mean and true std, maybe using the mean and variance from my validation set to normalize each layer could be a quick approximation.

I donâ€™t know what exact measure of goodness you are looking for. Yes it is a smoothed moving average with momentum. If you want a theoretical guarantee, then unfortunately you can definitely construct examples where this approximation can be bad. That being said, it is usually proved to be sufficient in practice. So people donâ€™t tend to worry about it.

They donâ€™t even need to be close to true mean and var, as long as they give good results. Afterall, itâ€™s just a smart â€śtrickâ€ť for training networks. If you are worried about this, I think you should more worried about why BN work, what internal covariate shifts are, and what is causing these shifts.

There is definitely reason for it. When you have a deep network of L layers with BN at almost every linear layer, and a large dataset of N samples. To get the accurate dataset statistics for each BN layer, you would need to activate the network O(NL) times, which is too much. Using running estimators to estimate the statistics of a dynamic (but converging) system is cheap and appears to give good results in the case of BN. So it is preferred in this scenario.

yes totally agree. I only do a big forward with all the dataset when the network is finished to run the test. For validation I always use running estimates.

Hi. For the last year and half I have been using a faster and lower cost normalisation that works perfectly with single sample SGD i.e. minibatchsize=1. This allows much larger deep nets to be trained. Please see this paper https://arxiv.org/abs/1706.03907 , Deep Control a simple automatic gain control for memory efficient and high performance training of deep convolutional neural networks.

note from moderators: read this post with skepticism. trusted users have reported that this paper is the same as a known and obvious technique, but upsells it with unknown intent

Iâ€™m not sure that many people have the need for an almost trivial instance norm variation that has dubiuos patent claims attached to it.
It just seems to invite trouble and advertising it here has me scratching my head whether it should be considered spam.

Not in the slightest Spam, and the patent is there mostly to protect the date and since is now public (i.e. license free).

The technique is in production in automotive research and explains which part of batch normalisation is actually useful (the mean subtraction and scaling) but that minibatches per se are not useful though not excluded.

The option is there for single sample SGD that saves a lot on memory footprint and more than a yearâ€™s use of the AGC technique suggests that single sample is the most accurate and allows me to get top results on the automotive datasets I use beating all other techniques, so there must be something useful