Hi,
As per the batch normalization paper,
A model employing Batch Normalization can be trained using batch gradient descent, or Stochastic Gradient Descent with a mini-batch size m > 1
This is because of the Bessel’s correction as pointed out by Adam
A guess would be that BatchNorm uses Bessel’s correction for variance and this makes it NaN (computed variance is 0, n / (n - 1) * var = 1 / 0 * 0 = NaN.
So if, you can afford to use batch size > 1, that would solve the NaN problem for you.
If you are using very small batch size or non i.i.d batches, maybe you could look at Batch Renormalization (https://arxiv.org/pdf/1702.03275.pdf).
Regards
Nabarun