Running mean/variance, biased or unbiased, running var via running mean or cur mean

AlbertZeyer · June 12, 2024, 9:21am

Currently for batch norm:

Batch mean and biased batch variance is used during training for renormalization.
- Why biased batch variance and not unbiased batch variance?
- Why not running mean/var? I would guess the running mean/var are better, esp if the mini batches are small?
Batch mean and unbiased batch variance is used during training to update running mean/var.
- Why unbiased variance here instead of biased variance?
Batch variance is estimated using the batch mean.
- Why using the batch mean and not the running mean? The running mean should be a better estimate of the mean than the batch mean?
  - If using the running mean, is this always unbiased?

I found some related discussions (but still don’t fully get the conclusions from those, i.e. what would be the answers to my questions above):

sally2 · June 12, 2024, 10:41am

Hey there,
I hope my response clarify your doubt:

Biased batch variance is used during training for renormalization because it’s faster to compute. However, using running mean/variance, especially with small mini-batches, might yield better results.
Unbiased batch variance is used during training to update running mean/var because it provides a more accurate estimation.
Batch mean is used to estimate batch variance because it’s readily available. However, using the running mean could potentially yield better estimates of the mean.

Using the running mean doesn’t always guarantee unbiasedness; it depends on the specific implementation.

AlbertZeyer · June 12, 2024, 1:11pm

Thanks for the reply!

Faster than what? Faster than running variance? But you anyway compute the running var, so I don’t understand where you would save sth? Or faster than unbiased batch variance? But I don’t understand how there can be any difference? In one case, after taking the sum (var_sum = sum((x - mean)**2) with mean = sum(x)/n), you divide by n (var_sum/n), in the other case, you divide by (n-1) (var_sum/(n-1)). See code here.

But then why is the unbiased batch variance not used during training for renormalization?

But you also have the running mean readily available?

I mean sum((x - running_mean)**2) / n. Is this biased or unbiased?