NaN when I use batch normalization (BatchNorm1d)

I made a module that uses the following MLP module:

class MLP(nn.Module):
    def __init__(self, size_layers, activation):
        super(MLP, self).__init__()
        self.layers=[]
        self.layersnorm = []
        self.activation=activation
        for i in range(len(size_layers)-1):
            self.layers.append(nn.Linear(size_layers[i], size_layers[i + 1]))
            self.add_module('layers_' + str(i),self.layers[-1])

            self.layersnorm.append(nn.BatchNorm1d(size_layers[i + 1]))
            self.add_module('BatchNorm1d_' + str(i), self.layersnorm[-1])

    def forward(self, x):
        for i in range(len(self.layers)-1):
            if self.activation=='relu':
                x = F.relu(self.layersnorm[i](self.layers[i](x)))
            elif self.activation=='lrelu':
                x = F.leaky_relu(self.layersnorm[i](self.layers[i](x)))
            elif self.activation=='tanh':
                x = F.tanh(self.layersnorm[i](self.layers[i](x)))
        x = self.layersnorm[-1](self.layers[-1](x))
        return x
    def l1reg(self):
        w=0.
        for i in range(len(self.layers)):
            w = w + torch.sum((self.layers[i].weight).abs())
        return w

everything works fine without batch normalization.
With batch normalization the training seem to work, but the evaluation (using model.eval()) produces NaN.
is there something Iā€™m doing wrong with batch normalization?

thanks!

2 Likes

I canā€™t see anything obviously wrong with the model, are you sure the test data doesnā€™t have NaNs inside?

Also, it might be easier for you to use nn.ModuleList instead of adding the modules and maintaining a list separately.

Could you provide a minimal script that reproduces the problem?

I can imagine having NaN during training mode if all the elements of the batch are zero, and so the mean and the std over the batch would be zero as well, leading to NaN.

thanks for the reply. the MLP code works independently but not inside another module. Iā€™ll check what is wrong.
in any case without ā€œmodel.eval()ā€ it works fine. This is strange.
where do I find an example or documentation of nn.ModuleList?
thank you!

The docs arenā€™t there yet, Iā€™ll be writing them today in the afternoon. You can construct it giving it a list of modules and that should work.

Do you find the reason why the loss becomes nan in the test mode ?
I have also a similar problem.

I also have that problem. I wanted to suggest increasing eps, which temporarily seemed to fixed the issue, but it didnā€™t. Is there any suggestion how to debug this? My input is fine (no nanā€™s).

Hello all,

I ran into a similar problem - I am using BatchNorm1d with a batch size of 1, which always results in running_vars which are NaNā€™s. Specifically, this only occurs with a batch of size 1.

This problem doesnā€™t occur with BatchNorm2d.

I thought it was possibly due to the eps value as someone suggested above, but this wouldnā€™t explain why itā€™s ok for 2d cases and why it doesnā€™t produce NaNā€™s for the first stddev calculation.

EDIT:

  • I presume the NaN isnā€™t a result of performing 1 / (0 + eps)? Where the 0 arises because it is computing the variance from a single example.

For example:

input = torch.FloatTensor(1,4).normal_(0,1)
bn = nn.BatchNorm1d(4)

output = bn(Variable(input))

print("output ...\n", output)
print("running mean ...\n",bn.running_mean)
print("running var ...\n",bn.running_var)

produces:

output ā€¦
Variable containing:
0 0 0 0
[torch.FloatTensor of size 1x4]

running mean ā€¦

0.0437
0.0830
0.0557
0.1216
[torch.FloatTensor of size 4]

running var ā€¦

nan
nan
nan
nan
[torch.FloatTensor of size 4]

Have I missed something obvious?

Cheers,
Jordan

A guess would be that BatchNorm uses Besselā€™s correction for variance and this makes it NaN (computed variance is 0, n / (n - 1) * var = 1 / 0 * 0 = NaN.

4 Likes

Yup, just found it in the sources.

So if Iā€™m understanding correctly, the solution is to use Batchnorm2d?

In case of batchnorm2d and batch size = 1. Does it work for you even in eval() mode? Iā€™m currently using batchnorm2d with batch size = 1, but I have to stay in train() mode, otherwise the accuracy drops dramatically.

1 Like

hello, did you have solved this problem? if i should use the BN1d layer

Hi,

As per the batch normalization paper,

A model employing Batch Normalization can be trained using batch gradient descent, or Stochastic Gradient Descent with a mini-batch size m > 1

This is because of the Besselā€™s correction as pointed out by Adam

A guess would be that BatchNorm uses Besselā€™s correction for variance and this makes it NaN (computed variance is 0, n / (n - 1) * var = 1 / 0 * 0 = NaN.

So if, you can afford to use batch size > 1, that would solve the NaN problem for you.

If you are using very small batch size or non i.i.d batches, maybe you could look at Batch Renormalization (https://arxiv.org/pdf/1702.03275.pdf).

Regards
Nabarun

1 Like

Hey, the totel of my test data is 10000, my batchsize is 32. but also only in model.eval(),the output of bn1d is nan

In that case there is some other problem, most probably with your data. Batchnorm by itself will not give nan for batch sizes greater than 1. Did you scale your data? If in your training you were using float in range 0-1 and in test if its int 0-65535, your network might blowup.

I have sloived the problem in my case.
my len(train_data) = 55937 and my batchsize = 64 >> 1, It looks like no problem.
but I have found that 55937 % 64 = 1, which means the last batchsize =1,
so runing_var becomes nan after 1 epoch.
hope it helps you.

6 Likes

But I wanna ask that here, does the n is the .num_batches_tracked in the BatchNorm parameters?
and why is that my batch num is not 1 and still get nan,:sob:

pretrain_dict['featureExtract.12.num_batches_tracked']
Out[52]: tensor(8638, device='cuda:0')
1 Like

A cause might be your features include NaN or inf values.

It is indeed a data problem. Thank you for your reply.