NaN when I use batch normalization (BatchNorm1d)

I made a module that uses the following MLP module:

class MLP(nn.Module):
    def __init__(self, size_layers, activation):
        super(MLP, self).__init__()
        self.layersnorm = []
        for i in range(len(size_layers)-1):
            self.layers.append(nn.Linear(size_layers[i], size_layers[i + 1]))
            self.add_module('layers_' + str(i),self.layers[-1])

            self.layersnorm.append(nn.BatchNorm1d(size_layers[i + 1]))
            self.add_module('BatchNorm1d_' + str(i), self.layersnorm[-1])

    def forward(self, x):
        for i in range(len(self.layers)-1):
            if self.activation=='relu':
                x = F.relu(self.layersnorm[i](self.layers[i](x)))
            elif self.activation=='lrelu':
                x = F.leaky_relu(self.layersnorm[i](self.layers[i](x)))
            elif self.activation=='tanh':
                x = F.tanh(self.layersnorm[i](self.layers[i](x)))
        x = self.layersnorm[-1](self.layers[-1](x))
        return x
    def l1reg(self):
        for i in range(len(self.layers)):
            w = w + torch.sum((self.layers[i].weight).abs())
        return w

everything works fine without batch normalization.
With batch normalization the training seem to work, but the evaluation (using model.eval()) produces NaN.
is there something I’m doing wrong with batch normalization?



I can’t see anything obviously wrong with the model, are you sure the test data doesn’t have NaNs inside?

Also, it might be easier for you to use nn.ModuleList instead of adding the modules and maintaining a list separately.

Could you provide a minimal script that reproduces the problem?

I can imagine having NaN during training mode if all the elements of the batch are zero, and so the mean and the std over the batch would be zero as well, leading to NaN.

thanks for the reply. the MLP code works independently but not inside another module. I’ll check what is wrong.
in any case without “model.eval()” it works fine. This is strange.
where do I find an example or documentation of nn.ModuleList?
thank you!

The docs aren’t there yet, I’ll be writing them today in the afternoon. You can construct it giving it a list of modules and that should work.

Do you find the reason why the loss becomes nan in the test mode ?
I have also a similar problem.

I also have that problem. I wanted to suggest increasing eps, which temporarily seemed to fixed the issue, but it didn’t. Is there any suggestion how to debug this? My input is fine (no nan’s).

Hello all,

I ran into a similar problem - I am using BatchNorm1d with a batch size of 1, which always results in running_vars which are NaN’s. Specifically, this only occurs with a batch of size 1.

This problem doesn’t occur with BatchNorm2d.

I thought it was possibly due to the eps value as someone suggested above, but this wouldn’t explain why it’s ok for 2d cases and why it doesn’t produce NaN’s for the first stddev calculation.


  • I presume the NaN isn’t a result of performing 1 / (0 + eps)? Where the 0 arises because it is computing the variance from a single example.

For example:

input = torch.FloatTensor(1,4).normal_(0,1)
bn = nn.BatchNorm1d(4)

output = bn(Variable(input))

print("output ...\n", output)
print("running mean ...\n",bn.running_mean)
print("running var ...\n",bn.running_var)


output …
Variable containing:
0 0 0 0
[torch.FloatTensor of size 1x4]

running mean …

[torch.FloatTensor of size 4]

running var …

[torch.FloatTensor of size 4]

Have I missed something obvious?


A guess would be that BatchNorm uses Bessel’s correction for variance and this makes it NaN (computed variance is 0, n / (n - 1) * var = 1 / 0 * 0 = NaN.


Yup, just found it in the sources.

So if I’m understanding correctly, the solution is to use Batchnorm2d?

In case of batchnorm2d and batch size = 1. Does it work for you even in eval() mode? I’m currently using batchnorm2d with batch size = 1, but I have to stay in train() mode, otherwise the accuracy drops dramatically.

1 Like

hello, did you have solved this problem? if i should use the BN1d layer


As per the batch normalization paper,

A model employing Batch Normalization can be trained using batch gradient descent, or Stochastic Gradient Descent with a mini-batch size m > 1

This is because of the Bessel’s correction as pointed out by Adam

A guess would be that BatchNorm uses Bessel’s correction for variance and this makes it NaN (computed variance is 0, n / (n - 1) * var = 1 / 0 * 0 = NaN.

So if, you can afford to use batch size > 1, that would solve the NaN problem for you.

If you are using very small batch size or non i.i.d batches, maybe you could look at Batch Renormalization (


1 Like

Hey, the totel of my test data is 10000, my batchsize is 32. but also only in model.eval(),the output of bn1d is nan

In that case there is some other problem, most probably with your data. Batchnorm by itself will not give nan for batch sizes greater than 1. Did you scale your data? If in your training you were using float in range 0-1 and in test if its int 0-65535, your network might blowup.

I have sloived the problem in my case.
my len(train_data) = 55937 and my batchsize = 64 >> 1, It looks like no problem.
but I have found that 55937 % 64 = 1, which means the last batchsize =1,
so runing_var becomes nan after 1 epoch.
hope it helps you.


But I wanna ask that here, does the n is the .num_batches_tracked in the BatchNorm parameters?
and why is that my batch num is not 1 and still get nan,:sob:

Out[52]: tensor(8638, device='cuda:0')
1 Like