Problem with nan values of the model Parameters: weight and bias

Hallo

I’m new in deep learning.
Once my batch is generated and i start to train my model i have always a problem with this nan values in
output = model(input_var)
When i debug i find also a nan values in the model parametres : weight and bias.
These nan values are generated after n iteration.
Do you have any idea where can the error.

Many thanks for your reply

Screenshot%20from%202018-03-15%2011-04-08|690x454690x454](upload://b9EYe4Tk1tGJDarCjYR3AaWocnJ.png)

Hello,
your screenshot/link is not displaying properly, which is kind of problematic if you want some help :blush:
Maybe a small formatted code of your program might also help people to give you some advice
(if I’m not mistaken, ``` surrounding code formats it to be easier to read)

def train_epoch_cpu(trainLoader,model,optimizer,criterion,epoch):

Losses=[]

model.train()
for i, (ids, tensor) in enumerate(trainLoader):
   
    input_var = torch.autograd.Variable(tensor)
    # compute output
    output = model(input_var)
    loss = criterion(output)

    # loss = angular_distance(output.data.cpu(), targets.cpu())
    # compute gradient and do SGD step
    optimizer.zero_grad()
    loss.backward()
    Losses.append(loss.data.cpu()[0])
    optimizer.step()


    #print('Train Epoch: [{0}][{1}/{2}]\t'
       # 'angles {ang} ({ang})\t'
      #  'Loss {loss} ({loss})\t'.format(
       #  epoch, i, len(trainLoader), ang=loss, loss=Losses))

return loss

here in output = model(input_var) the returned output is nan.
When i debug the code i find the model parametrs are nan also.
the initial declared net is :
NetDSpace (
(conv1): Conv2d(1, 6, kernel_size=(3, 3), stride=(1, 1))
(pool): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1))
(conv2): Conv2d(6, 16, kernel_size=(3, 3), stride=(1, 1))
(fc1): Linear (144 -> 2048)
(fc2): Linear (2048 -> 1024)
(fc3): Linear (1024 -> 400)

Is your tensor also returning a nan ( when you make it into a Variable?)

Can you explain please in wihch step or function do you mean.
many thanks

Sure, I was thinking about this line, maybe there is something that goes wrong in your tensor ?

How Can I detect that’s some thing is wrong.
Normaly the input size is [torch.FloatTensor of size 64x1x20x20]
64 is the size ob Batch
20x20 the size of my input image

Is there any error message reported?

There are no error message but the problem in the printed loss values:
(I have modified the size of the net)
NetDSpace (
(conv1): Conv2d(1, 6, kernel_size=(3, 3), stride=(1, 1))
(pool): MaxPool2d (size=(2, 2), stride=(2, 2), dilation=(1, 1))
(conv2): Conv2d(6, 16, kernel_size=(3, 3), stride=(1, 1))
(fc1): Linear (144 -> 120)
(fc2): Linear (120 -> 84)
(fc3): Linear (84 -> 10)
)


Average IOU during train Reg0 , epoch 0 is Variable containing:
0.2594
[torch.FloatTensor of size 1]


Average IOU during train Reg0 , epoch 1 is Variable containing:
0.3301
[torch.FloatTensor of size 1]


Average IOU during train Reg0 , epoch 2 is Variable containing:
nan
[torch.FloatTensor of size 1]


Average IOU during train Reg0 , epoch 3 is Variable containing:
nan
[torch.FloatTensor of size 1]


Average IOU during train Reg0 , epoch 4 is Variable containing:
nan
[torch.FloatTensor of size 1]

My guess is that the inputs are not properly normalised.

how can i ensure the normalisation.
My batch is the filename list of the input training gray images

If there any suggestion to normalize it properly?

Although the proper way is to find the mean and variance for your whole training set and use that to normalise your images (scikit-learn has some classes for this) there is a quicker way to validate if normalisation helps.

Just add a batch normalisation (torch.nn.BatchNorm2d) as the very first layer of your network and it will normalise on a per batch basis (64 images in your case). Not perfect but should give enough intuition if this will solve your problem.

And when indeed that helps you can use normalise over your whole training set.

Hi Peter
Many thanks for your recommandation.
In my code like that:

def __init__(self):
    super(NetDSpace, self).__init__()
    self.conv1 = nn.Conv2d(1, 6, 3)
    self.pool = nn.MaxPool2d(2, 2)
    self.conv2 = nn.Conv2d(6, 16, 3)
    self.conv2_bn = nn.BatchNorm2d(16)
    self.fc1 = nn.Linear(16 * 3 * 3, 120)
    self.fc1_bn = nn.BatchNorm1d(120)
    self.fc2 = nn.Linear(120, 84)
    self.fc3 = nn.Linear(84, 10)

def forward(self, x):
    x = self.pool(F.relu(self.conv1(x)))
    x = self.pool(F.relu(self.conv2(x)))
    x = x.view(-1, 16 * 3 * 3)
    x = F.relu(self.fc1_bn(self.fc1(x)))
    x = F.relu(self.fc2(x))
    x = self.fc3(x)
    return x

So It’s correct?
I have tested and I have no more nan values but the loss values are very variable some times decrease and some times increase. (not stable)
Question please What did you mean with ‘And when indeed that helps you can use normalise over your whole training set.’

I actually meant to say to only add the BatchNorm as the first layer of your network (so before conv1) in order to “simulate” normalisation. But if this works and avoids the NaN then indeed your problem (or part of it) seems to be normalisation or more correct the lack of it.

The downside of BatchNorm is that the normalisation only happens per batch, so 64 images in your case. You normally achieves better results if you do the normalisation based on your complete training set and not just 64 at a time.

So you calculate the mean and variance for all you training images and then normalise each image using this mean and variance. This would be done in the preprocessing phase (and not part of your network as is the case with BatchNorm).I’m guess some searching on image normalisation should give you some reusable code snippets.

I wonder if the normalization solved your probelm? I am interested because I have similar problem which I cannot solve

The normalization don’t resolve the problem, but can help as the main problem in my case was related to the loss function.

I do not get the point, how image normalization can in theory help avoiding NaNs? Although I understand why normalization is helpful (in terms of stability) even at the first layer of a network your point seems super strange to me.

Unnormalized inputs could theoretically create large intermediate activations. If the training diverges due to these high values, you might encounter an overflow and could run into NaNs.
This happened a few times in other posts and you would usually see a very high loss in some iterations before it blows up.

2 Likes