NaN when I use batch normalization (BatchNorm1d)

davidenitti · February 3, 2017, 6:16pm

I made a module that uses the following MLP module:

class MLP(nn.Module):
    def __init__(self, size_layers, activation):
        super(MLP, self).__init__()
        self.layers=[]
        self.layersnorm = []
        self.activation=activation
        for i in range(len(size_layers)-1):
            self.layers.append(nn.Linear(size_layers[i], size_layers[i + 1]))
            self.add_module('layers_' + str(i),self.layers[-1])

            self.layersnorm.append(nn.BatchNorm1d(size_layers[i + 1]))
            self.add_module('BatchNorm1d_' + str(i), self.layersnorm[-1])

    def forward(self, x):
        for i in range(len(self.layers)-1):
            if self.activation=='relu':
                x = F.relu(self.layersnorm[i](self.layers[i](x)))
            elif self.activation=='lrelu':
                x = F.leaky_relu(self.layersnorm[i](self.layers[i](x)))
            elif self.activation=='tanh':
                x = F.tanh(self.layersnorm[i](self.layers[i](x)))
        x = self.layersnorm[-1](self.layers[-1](x))
        return x
    def l1reg(self):
        w=0.
        for i in range(len(self.layers)):
            w = w + torch.sum((self.layers[i].weight).abs())
        return w

everything works fine without batch normalization.
With batch normalization the training seem to work, but the evaluation (using model.eval()) produces NaN.
is there something I’m doing wrong with batch normalization?

thanks!

apaszke · February 3, 2017, 6:38pm

I can’t see anything obviously wrong with the model, are you sure the test data doesn’t have NaNs inside?

Also, it might be easier for you to use nn.ModuleList instead of adding the modules and maintaining a list separately.

fmassa · February 3, 2017, 6:39pm

Could you provide a minimal script that reproduces the problem?

I can imagine having NaN during training mode if all the elements of the batch are zero, and so the mean and the std over the batch would be zero as well, leading to NaN.

davidenitti · February 5, 2017, 2:01am

thanks for the reply. the MLP code works independently but not inside another module. I’ll check what is wrong.
in any case without “model.eval()” it works fine. This is strange.
where do I find an example or documentation of nn.ModuleList?
thank you!

apaszke · February 5, 2017, 11:07am

The docs aren’t there yet, I’ll be writing them today in the afternoon. You can construct it giving it a list of modules and that should work.

Seungyoung_Park · March 30, 2017, 3:30am

Do you find the reason why the loss becomes nan in the test mode ?
I have also a similar problem.

smb · April 6, 2017, 4:48pm

I also have that problem. I wanted to suggest increasing eps, which temporarily seemed to fixed the issue, but it didn’t. Is there any suggestion how to debug this? My input is fine (no nan’s).

Jordan_Campbell · April 26, 2017, 5:54am

Hello all,

I ran into a similar problem - I am using BatchNorm1d with a batch size of 1, which always results in running_vars which are NaN’s. Specifically, this only occurs with a batch of size 1.

This problem doesn’t occur with BatchNorm2d.

I thought it was possibly due to the eps value as someone suggested above, but this wouldn’t explain why it’s ok for 2d cases and why it doesn’t produce NaN’s for the first stddev calculation.

EDIT:

I presume the NaN isn’t a result of performing 1 / (0 + eps)? Where the 0 arises because it is computing the variance from a single example.

For example:

input = torch.FloatTensor(1,4).normal_(0,1)
bn = nn.BatchNorm1d(4)

output = bn(Variable(input))

print("output ...\n", output)
print("running mean ...\n",bn.running_mean)
print("running var ...\n",bn.running_var)

produces:

output …
Variable containing:
0 0 0 0
[torch.FloatTensor of size 1x4]

running mean …

0.0437
0.0830
0.0557
0.1216
[torch.FloatTensor of size 4]

running var …

nan
nan
nan
nan
[torch.FloatTensor of size 4]

Have I missed something obvious?

Cheers,
Jordan

apaszke · April 27, 2017, 5:45pm

A guess would be that BatchNorm uses Bessel’s correction for variance and this makes it NaN (computed variance is 0, n / (n - 1) * var = 1 / 0 * 0 = NaN.

apaszke · April 27, 2017, 5:46pm

Yup, just found it in the sources.

kbrown42 · April 27, 2017, 8:19pm

So if I’m understanding correctly, the solution is to use Batchnorm2d?

lhatsk · April 28, 2017, 10:06am

In case of batchnorm2d and batch size = 1. Does it work for you even in eval() mode? I’m currently using batchnorm2d with batch size = 1, but I have to stay in train() mode, otherwise the accuracy drops dramatically.

cold_wind · August 29, 2017, 2:18am

hello, did you have solved this problem? if i should use the BN1d layer

Nabarun_Goswami · August 29, 2017, 5:56am

Hi,

As per the batch normalization paper,

A model employing Batch Normalization can be trained using batch gradient descent, or Stochastic Gradient Descent with a mini-batch size m > 1

This is because of the Bessel’s correction as pointed out by Adam

A guess would be that BatchNorm uses Bessel’s correction for variance and this makes it NaN (computed variance is 0, n / (n - 1) * var = 1 / 0 * 0 = NaN.

So if, you can afford to use batch size > 1, that would solve the NaN problem for you.

If you are using very small batch size or non i.i.d batches, maybe you could look at Batch Renormalization (https://arxiv.org/pdf/1702.03275.pdf).

Regards
Nabarun

cold_wind · August 29, 2017, 7:29am

Hey, the totel of my test data is 10000, my batchsize is 32. but also only in model.eval(),the output of bn1d is nan

Nabarun_Goswami · August 30, 2017, 1:29am

In that case there is some other problem, most probably with your data. Batchnorm by itself will not give nan for batch sizes greater than 1. Did you scale your data? If in your training you were using float in range 0-1 and in test if its int 0-65535, your network might blowup.

liubola · October 25, 2017, 1:07am

I have sloived the problem in my case.
my len(train_data) = 55937 and my batchsize = 64 >> 1, It looks like no problem.
but I have found that 55937 % 64 = 1, which means the last batchsize =1,
so runing_var becomes nan after 1 epoch.
hope it helps you.

Zichun_Zhang · January 8, 2019, 3:01pm

But I wanna ask that here, does the n is the .num_batches_tracked in the BatchNorm parameters?
and why is that my batch num is not 1 and still get nan,

pretrain_dict['featureExtract.12.num_batches_tracked']
Out[52]: tensor(8638, device='cuda:0')

Johannes_Tomasoni · June 3, 2023, 2:46pm

A cause might be your features include NaN or inf values.

Alva-2020 · July 19, 2023, 7:28am

It is indeed a data problem. Thank you for your reply.