Strange behavior of BatchNorm2d in evaluation mode (train vs eval)

During training I set all BatchNorm2d layers to track_running_stats=True, affine=True. I use batch size = 1. During evaluation the batch size is also 1, and what I’ve noticed that if the batch normalization layers are set train=True (i.e. they don’t use means and stds to normalize the image), performance is much better. What am I doing wrong here?

Leaving batch norm training enabled and using batch size 1 makes it behave in a similar way to InstanceNorm w.r.t. training and inference (but you have a different setup what to average over). Depending on the task, that can be a reasonable choice. We usually don’t do that because it is problematic run into problems when inputs are very “dull” (i.e. have very little standard deviaton), they are blown up a lot, so the neural network considered as a function may not be not terribly continuous (well, technically continuity is an “on/off” term, but things can get really “steep”).

Best regards

Thomas

Thanks, I guess I don’t fully understand how batchnorm works during training/evaluation in pytorch. During training, batchnorm accumulates the history of means/stds and normalizes the inputs using the current. From the doc: `

‘by default, during training this layer keeps running estimates of its computed mean and variance, which are then used for normalization during evaluation’

Does it mean than only one mean/std (‘compound’) is used during the evaluation? I noticed Pytorch stores either 64 or 256 means/std.

No it is by channel, but so the various *Norm modules provide all sorts of variations. I would always recommend to experiment with things to be certain they work as expected: set 1 channel mean to 1 and the others to 0 and all std to 0 and see what happens.
Then look at what changes when setting 1 std to 2 etc.

Best regards

Thomas

Won’t it return all NaNs if I set std=0?

No, because they always add an epsilon that is small enough to prevent the std = 0 but still not affects a lot to the result (1e-5 or something).

I guess your problem should be related to the running mean and running variance since you set your batch size as 1, the running stats will be strong affected by the later sample. You can try to reduce the momentum or change to group norm.

Thanks. So for batch size > 1, are mean/std averaged per batch in such case? That is, for each channel, I take x(one feature) and subtract the mean for that channel, after taking the mean average of that channel in the whole batch?

yes, we use batch mean in the training phase and the average for evaluation phase