I have been using BN to train audoencoders over a large number of image patches (50K/image) of different architectures recently. There is indeed gotcha whenever BN is used with the dataset as follows.
After training a long time (70 epochs or more with 4K batches each), the validation loss suddenly increases significantly and never comes back while the training loss remains stable. Decreasing the learning rate only postpones the phenomenon. The trained model at this point is not usable if model.eval() is called as it is supposed to be. But if the output is normalized to the regular pixel range, the results seem alright. After several trials, the cause is likely the default epsilon which may be too small (1e-5) for long term stability. However, increasing the epsilon leads to a slightly higher validation loss. Alternatively, the running mean and var computed by pytorch under the hood may have something to do with since fixing BN at the training mode also alleviates the issue for inference time.
In view of the popularity of BN, I am a bit surprised everyone seems happy with it but is it really the case?