Batch normalization, batch size, and data loader's last batch

Hi @dtolpin,

thank you for sharing this interesting problem and the detailed analysis.
To me, your first option (making all batches the same size) sounds the one that is more reasonable in practice. Quite likely, you could just pick random samples to duplicate for this and be done with it.

I must admit that I am quite unsure whether I interpret pytorch’s momentum parameter correctly, but if it means something like alpha in

running_mean_estimate = alpha * running_mean_estimate + (1-alpha) * minibatch_mean,

I would expect something more like 0.9 rather than pytorch’s default of 0.1. So changing the momentum might help, too, in particular if your analysis for option 2 (use minibatch size in running average computation) is correct.

If you wanted to go down option 2, the other (and I would almost expect it to be the more significant) shortcoming of the batch normalization as described in Ioffe and Szegedy’s original article as Algorithm 1 is that during training, the mean and std are taken from the current minibatch. For very small minibatches, I would expect that to be disadvantageous and using a regularization like

regularized_mean_estimate = (actual_batchsize * minibatch_mean +  ((target_batchsize-actual_batchsize) * running_mean_estimate) / target_batchsize

regularized_variance_estimate = ((actual_batchsize-1) * minibatch_mean +  ((target_batchsize - actual_batchsize) * running_mean_estimate) / (target_batchsize-1)

to work much better. (You could have a fancy Bayesian thing to average them, too, and find out why and how my weights above are rubbish, but it might be a starting point.)

As I said above, in practice, I would probably go with amending the data to fill up the last minibatch. On the other hand, might be fun to see which of your suggestion for running mean/std estimate updates, the blanket momentum adjustment, and regularization in the training batch normalisation works best.

Best regards

Thomas