NaN when I use batch normalization (BatchNorm1d)

Nabarun_Goswami · August 29, 2017, 5:56am

Hi,

As per the batch normalization paper,

A model employing Batch Normalization can be trained using batch gradient descent, or Stochastic Gradient Descent with a mini-batch size m > 1

This is because of the Bessel’s correction as pointed out by Adam

A guess would be that BatchNorm uses Bessel’s correction for variance and this makes it NaN (computed variance is 0, n / (n - 1) * var = 1 / 0 * 0 = NaN.

So if, you can afford to use batch size > 1, that would solve the NaN problem for you.

If you are using very small batch size or non i.i.d batches, maybe you could look at Batch Renormalization (https://arxiv.org/pdf/1702.03275.pdf).

Regards
Nabarun