Neural Network not training and giving same output for every epoch

Hi,

I am training a model on CIFAR10 dataset and for some reason, the weights don’t seem to change at all:

Epoch 0/24
----------
100%|██████████| 3125/3125 [00:11<00:00, 280.29it/s]
train Loss: 4.0429 Acc: 0.1667
100%|██████████| 3125/3125 [00:07<00:00, 413.33it/s]
val Loss: 4.0429 Acc: 0.1667

Epoch 1/24
----------
100%|██████████| 3125/3125 [00:10<00:00, 286.69it/s]
train Loss: 4.0429 Acc: 0.1667
100%|██████████| 3125/3125 [00:07<00:00, 411.33it/s]
val Loss: 4.0429 Acc: 0.1667

Epoch 2/24
----------
100%|██████████| 3125/3125 [00:11<00:00, 281.77it/s]
train Loss: 4.0429 Acc: 0.1667
100%|██████████| 3125/3125 [00:07<00:00, 414.74it/s]
val Loss: 4.0429 Acc: 0.1667

Epoch 3/24
----------
100%|██████████| 3125/3125 [00:11<00:00, 279.62it/s]
train Loss: 4.0429 Acc: 0.1667

This is how my model is training:

def train_model(model, criterion, optimizer, scheduler, num_epochs=25):
    # since = time.time()
    # best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0
    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)
        # Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            if phase == 'train':
                model.train()  # Set model to training mode
            else:
                model.eval()   # Set model to evaluate mode
            running_loss = 0.0
            running_corrects = 0
            # Iterate over data.
            for inputs, labels in tqdm(training_loader):
                inputs = inputs.cuda()
                labels = labels.cuda()
                # zero the parameter gradients
                optimizer.zero_grad()
                with torch.set_grad_enabled(phase == 'train'):
                    outputs = model(inputs)
                    _, preds = torch.max(outputs, 1)
                    loss = criterion(outputs, labels)
                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)
            if phase == 'train':
                scheduler.step()
            epoch_loss = running_loss / 30000
            epoch_acc = running_corrects.double() / 30000
            print('{} Loss: {:.4f} Acc: {:.4f}'.format(
                phase, epoch_loss, epoch_acc))
        print()
    return model 

I have tried different learning rates, different optimisers, and everything but nothing seems to work. What am I doing wrong?

This seems indeed unexpected. Could you post an executable code snippet, which would reproduce this issue, please?

Not sure what’s the reason but take a look at couple inconsistencies:

  • No input data or dataset or dataloader passed to train_model, but there is training_loader
  • You are validating model on training data, whatever it is

I haven’t added the code here but the data is loading properly, I checked.

Thanks. I will add some code.

@ptrblck, will it be fine if I sent you a colab notebook? Or do you recommend something else?

A Colab notebook might work, but you could also post the code directly here.

conv1 = nn.Conv2d(3,8,kernel_size=(3,4), padding='same')
conv2 = nn.Conv2d(8, 32, kernel_size=2, padding='same')
conv3 = nn.Conv2d(32,64,kernel_size=3, padding='same')
conv4 = nn.Conv2d(64, 128, kernel_size=3, padding='same')
conv5 = nn.Conv2d(128,256, kernel_size=3, padding='same')
conv6 = nn.Conv2d(256,256, kernel_size=(3,2), padding='same')

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = conv1
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = conv2
        self.conv3 = conv3
        self.conv4 = conv4
        self.conv5 = conv5
        self.conv6 = conv6
        self.fc1 = nn.Linear(256, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):


        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = self.pool(F.relu(self.conv3(x)))
        x = self.pool(F.relu(self.conv4(x)))
        x = self.pool(F.relu(self.conv5(x)))
        x = x.view(-1, 256)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = torch.sigmoid(self.fc3(x))
        return x
cifar_model = Net()

training_data = datasets.CIFAR10(root="data", train=True, download=True,
                                  transform=transforms.Compose([
                                      transforms.ToTensor(),
                                      transforms.Normalize((0.5,0.5,0.5), (1.0,1.0,1.0))
                                  ]))

validation_data = datasets.CIFAR10(root="data", train=False, download=True,
                                  transform=transforms.Compose([
                                      transforms.ToTensor(),
                                      transforms.Normalize((0.5,0.5,0.5), (1.0,1.0,1.0))
                                  ]))

training_loader = DataLoader(training_data, 
                             batch_size=16, 
                             shuffle=True,
                             pin_memory=True)

validation_loader = DataLoader(validation_data,
                               batch_size=16,
                               shuffle=True,
                               pin_memory=True)

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(cifar_model.parameters(), lr=1, momentum=0.9)
exp_lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)

model = train_model(cifar_model, criterion, optimizer, exp_lr_scheduler,
                       num_epochs=25)

The train_model is exactly like above

Thanks for the code.
Could you remove the sigmoid as the last non-linearity, as nn.CrossEntropyLoss expects raw logits, and rerun the code?

EDIT: Also note that you are passing vgg_comp_model.parameters() to the optimizer, which is undefined in your code.

=-

I have tried this. It didn’t work. I remember you’ve recommended this before, so I’ve already tried it.

Apologies for this. Kindly change the name to cifar_model as defined.

After fixing the code, you can see that the model is changing as the loss and accuracy are moving.

Epoch 0/24
----------
train Loss: 4.0923 Acc: 0.1649
val Loss: 4.2621 Acc: 0.1667

Epoch 1/24
----------
train Loss: 4.0785 Acc: 0.1646
val Loss: 4.1317 Acc: 0.1667

Epoch 2/24
----------
train Loss: 4.0882 Acc: 0.1650
val Loss: 4.1552 Acc: 0.1667

Epoch 3/24
----------
train Loss: 4.0823 Acc: 0.1675
val Loss: 4.3270 Acc: 0.1667

Epoch 4/24
----------
train Loss: 4.0952 Acc: 0.1669
val Loss: 4.0157 Acc: 0.1667

However, most likely due to the high learning rate the loss is not decreasing.
Using a standard Adam optimizer setup, such as:

optimizer = torch.optim.Adam(cifar_model.parameters(), lr=1e-3)

yields better results right from the start:

Epoch 0/24
----------
train Loss: 2.6817 Acc: 0.6387
val Loss: 2.1259 Acc: 0.8903

Epoch 1/24
----------
train Loss: 1.9875 Acc: 0.9448
val Loss: 1.7324 Acc: 1.0461

Epoch 2/24
----------
train Loss: 1.7088 Acc: 1.0580
val Loss: 1.5539 Acc: 1.1126

Epoch 3/24
----------
train Loss: 1.5190 Acc: 1.1288
val Loss: 1.3736 Acc: 1.1824

Epoch 4/24
----------
train Loss: 1.3708 Acc: 1.1851
val Loss: 1.1912 Acc: 1.2527

However, as you can see, your accuracy calculation is wrong, since you are dividing the accuracy by 30000 instead of the number of samples of the dataset (50000 for the training and 10000 for the validation split) in:

epoch_acc = running_corrects.double() / 30000

Also, you are using the training_loader only while the print assumes that a validation loss is also calculated.

Thanks for pointing out the mistake. I think I reused my previous code for this.

I am experimenting with something which is why I have pre-defined the weights for my model. It worked for the smaller models which is why I am trying bigger models now.

So I have pre-assigned weights to my model and then I am training my model. However, as you suggested with Adam optimiser, it’s still now learning.

Epoch 0/24
----------
100%|██████████| 3125/3125 [00:12<00:00, 250.17it/s]
train Loss: 2.3923 Acc: 0.5000
100%|██████████| 3125/3125 [00:07<00:00, 402.19it/s]
val Loss: 2.3923 Acc: 0.5000

Epoch 1/24
----------
100%|██████████| 3125/3125 [00:12<00:00, 254.30it/s]
train Loss: 2.3923 Acc: 0.5000
100%|██████████| 3125/3125 [00:07<00:00, 413.95it/s]
val Loss: 2.3923 Acc: 0.5000

Epoch 2/24
----------
100%|██████████| 3125/3125 [00:12<00:00, 258.83it/s]
train Loss: 2.3923 Acc: 0.5000
100%|██████████| 3125/3125 [00:07<00:00, 417.37it/s]
val Loss: 2.3923 Acc: 0.5000

What else do you suggest?

Validating on the same data as for training is pretty much testing if dropout and batchnorm layers are working ok.

As for your case, you can track your loss with more than 4 digits after decimal (and per each batch with shuffle=False). If it doesn’t change at all, then either something wrong with loss value or with gradients, since weights are not adjusted at all. IDK, ‘nan’ in gradients? Or maybe, just maybe, you’re overwriting weights each epoch?

The code itself is working with proper hyperparameters as can be seen by my output, so I would guess that your overall training is failing due to the weight init and the high learning rate.

@ptrblck, So I checked and these are the values I get for output, loss and gradient for the last layer:

tensor([[0., 0., 0., 1., 0., 1., 0., 0., 1., 0.],
        [0., 0., 0., 1., 0., 1., 0., 0., 1., 0.],
        [0., 0., 0., 1., 0., 1., 0., 0., 1., 0.],
        [0., 0., 0., 1., 0., 1., 0., 0., 1., 0.],
        [0., 0., 0., 1., 0., 1., 0., 0., 1., 0.],
        [0., 0., 0., 1., 0., 1., 0., 0., 1., 0.],
        [0., 0., 0., 1., 0., 1., 0., 0., 1., 0.],
        [0., 0., 0., 1., 0., 1., 0., 0., 1., 0.],
        [0., 0., 0., 1., 0., 1., 0., 0., 1., 0.],
        [0., 0., 0., 1., 0., 1., 0., 0., 1., 0.],
        [0., 0., 0., 1., 0., 1., 0., 0., 1., 0.],
        [0., 0., 0., 1., 0., 1., 0., 0., 1., 0.],
        [0., 0., 0., 1., 0., 1., 0., 0., 1., 0.],
        [0., 0., 0., 1., 0., 1., 0., 0., 1., 0.],
        [0., 0., 0., 1., 0., 1., 0., 0., 1., 0.],
        [0., 0., 0., 1., 0., 1., 0., 0., 1., 0.]], device='cuda:0',
       grad_fn=<SigmoidBackward>)

It’s giving the same output to every image for some reason. Maybe it’s fine because it hasn’t learned anything.

The loss is this:
tensor(2.3433, device='cuda:0', grad_fn=<NllLossBackward>)

But when I print the gradient in the last layer, I get 0:

tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,

The weights in the final layer are not pre-defined. So there should be some gradient in the final layer at least.

You are still using the sigmoid, which is wrong as nn.CrossEntropyLoss expects logits and will also saturate your outputs/gradients.

1 Like

So I have removed sigmoid from the final layer. These are the values of outputs, loss, gradient:

tensor([[42.0968, 36.8111, 41.2566, 41.8263, 42.0318, 43.3060, 40.5994, 41.5220,
         39.2434, 42.2288],
        [42.0968, 36.8111, 41.2566, 41.8263, 42.0318, 43.3060, 40.5994, 41.5220,
         39.2434, 42.2288],
        [42.0968, 36.8111, 41.2566, 41.8263, 42.0318, 43.3060, 40.5994, 41.5220,
         39.2434, 42.2288],
        [42.0968, 36.8111, 41.2566, 41.8263, 42.0318, 43.3060, 40.5994, 41.5220,
         39.2434, 42.2288],
        [42.0968, 36.8111, 41.2566, 41.8263, 42.0318, 43.3060, 40.5994, 41.5220,
         39.2434, 42.2288],
        [42.0968, 36.8111, 41.2566, 41.8263, 42.0318, 43.3060, 40.5994, 41.5220,
         39.2434, 42.2288],
        [42.0968, 36.8111, 41.2566, 41.8263, 42.0318, 43.3060, 40.5994, 41.5220,
         39.2434, 42.2288],
        [42.0968, 36.8111, 41.2566, 41.8263, 42.0318, 43.3060, 40.5994, 41.5220,
         39.2434, 42.2288],
        [42.0968, 36.8111, 41.2566, 41.8263, 42.0318, 43.3060, 40.5994, 41.5220,
         39.2434, 42.2288],
        [42.0968, 36.8111, 41.2566, 41.8263, 42.0318, 43.3060, 40.5994, 41.5220,
         39.2434, 42.2288],
        [42.0968, 36.8111, 41.2566, 41.8263, 42.0318, 43.3060, 40.5994, 41.5220,
         39.2434, 42.2288],
        [42.0968, 36.8111, 41.2566, 41.8263, 42.0318, 43.3060, 40.5994, 41.5220,
         39.2434, 42.2288],
        [42.0968, 36.8111, 41.2566, 41.8263, 42.0318, 43.3060, 40.5994, 41.5220,

LOSS:

tensor(3.8067, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(3.4348, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(2.8786, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(2.9722, device='cuda:0', grad_fn=<NllLossBackward>)

Gradient:

tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,

An update. Instead of sigmoid, I used nn.LogSoftmax. Sometimes it trains and sometimes the outputs are nan. But I get this warning:

UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.

What are your thoughts on this?

LogSoftmax is also not expected as described before. Remove any activation function applied on the output of the model. As you could see in my outputs, the model is training fine with the fixes.
If you get stuck, please post a minimal executable code snippet, which would reproduce the issue.

So after removing the activation layer, it is training as usual. Doesn’t make sense though, why can’t we have an activation layer? Should we normalize the output for classification?