Output of Model is nan every time

Hi all.
I’m new to Pytorch. I’m trying to build my own classifier. I have a dataset with nearly 30 thousand images and 52 classes and each image has 60 * 80 size.
This is my network (I’m not sure about the number of neurons in each layer).

class my_network(nn.Module):
    
    def __init__(self, class_num, act=F.relu):
        
        super(my_network, self).__init__()
        
        self.layer1 = nn.Linear(1 * 60 * 80, 50 * 30 * 40)
        self.act1 = act 
        
        self.layer2 = nn.Linear(50 * 30 * 40, 70 * 10 * 15)
        self.act2 = act 
        
        self.layer3 = nn.Linear(70 * 10 * 15, 90 * 5 * 8)
        self.act3 = act
        
        self.layer4 = nn.Linear(90 * 5 * 8, 80)
        self.act4 = act
        
        self.layer5 = nn.Linear(80, class_num)
        
    def forward(self, x):

        x = x.view(x.size(0), -1)

        x = self.layer1(x)
        x = self.act1(x)

        x = self.layer2(x)
        x = self.act2(x)

        x = self.layer3(x)
        x = self.act3(x)

        x = self.layer4(x)
        x = self.act4(x)

        x = self.layer5(x)
        return x

I’m using Cuda for my model, CrossEntropyLoss for my criterion, and SGD for my optimizer.

model = my_network(len(classes))
model = model.to(device)

learning_rate = 0.01
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

I use the following code for training my model.

for epoch in range(num_epochs):
      train_loss = 0.
 
      for images, labels in train_loader:
          
        images = images.to(device)
        labels = labels.to(device)
      
        optimizer.zero_grad()

        outputs = model(images)
        print(outputs)
        loss = criterion(outputs, labels)

        loss.backward()
        
        optimizer.step()

     
        train_loss += loss.item()
    
      average_loss = train_loss / len(train_loader)

And when I run this, I get nan in output.

tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0',

also, I don’t want to use normalization for my data and I want to use them in this manner.
my inputs do not have any nan or inf value.
what am I doing wrong?

Hi,

Do you see how your loss is changing before you get nan?

If your inputs are not normalized, you might need to reduce the learning rate more to ensure that the training is stable.

1 Like

That seems odd. I can’t find anything in your case that would cause your error (expect that there would be something weird with the data). Have you tried to overfit one a single example before training on your entire dataset?

Also I think you could make your network a little cleaner (or I’m not sure why you use act):

class my_network(nn.Module):
    def __init__(self, num_classes):
        super(my_network, self).__init__()

        self.layer1 = nn.Linear(1 * 60 * 80, 50 * 30 * 40)
        self.layer2 = nn.Linear(50 * 30 * 40, 70 * 10 * 15)
        self.layer3 = nn.Linear(70 * 10 * 15, 90 * 5 * 8)
        self.layer4 = nn.Linear(90 * 5 * 8, 80)
        self.layer5 = nn.Linear(80, num_classes)

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = F.relu(self.layer1(x))
        x = F.relu(self.layer2(x))
        x = F.relu(self.layer3(x))
        x = F.relu(self.layer4(x))
        out = self.layer5(x)
        return out

model = my_network(num_classes=52)

I was also thinking that since you’re training on images it would probably be beneficial from a performance viewpoint to use convolutional neural networks rather than a fully connected (but perhaps something you can do after you’ve identified the error).

Hi, why unnormalized inputs will cause the model to diverge?

1 Like

Hello man, I ones had the same problem, I was getting Nans all the time. Lastly I found out that my data contains nans. Run the below command to check if your data contains Nans

outputs = model(images)
check = int((outputs != outputs).sum())
if(check>0):
    print("your data contains Nan")
else:
    print("Your data does not contain Nan, it might be other problem")
1 Like

Similarly, you can replace outputs with the loss or label in the code I sent to check where nan is coming from.

I restarted my system and it started training. I think even the GPU messes up.

Ensure no outputs from the model are 0. Cross entropy fails with 0 and negatives.