PyTorch CNN Returns Only One Result After Training

I’m training a CNN image classifier. The network classifies 255 x 255 RGB images into five categories numbered 0 to 4.

But the network is behaving strangely during training. Although the loss function drops smoothly, the model returns identical answers for all the samples in the batch most of the time. Even more strangely, eventually it starts answering only 2.

Here’s a typical training output with batches of 10 images.

LABELS                                 OUTPUT                                 CORRECT
tensor([2, 0, 2, 2, 2, 0, 2, 2, 2, 4]) tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) 2 / 10
tensor([2, 2, 2, 2, 3, 4, 1, 2, 2, 2]) tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) 0 / 10
tensor([2, 2, 2, 0, 2, 4, 3, 1, 2, 2]) tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) 1 / 10
tensor([3, 4, 2, 2, 0, 4, 4, 3, 2, 0]) tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) 2 / 10
tensor([1, 2, 2, 4, 2, 0, 1, 0, 0, 0]) tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) 4 / 10
tensor([2, 2, 2, 3, 2, 0, 0, 1, 2, 2]) tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) 2 / 10
tensor([1, 1, 0, 1, 2, 2, 1, 1, 0, 1]) tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) 2 / 10
tensor([0, 2, 1, 3, 3, 2, 1, 0, 2, 2]) tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) 2 / 10
tensor([2, 3, 2, 2, 3, 1, 0, 1, 0, 2]) tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) 2 / 10
tensor([3, 2, 3, 1, 1, 2, 0, 4, 2, 2]) tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) 1 / 10
tensor([2, 1, 0, 3, 1, 2, 2, 1, 2, 0]) tensor([2, 2, 2, 2, 2, 0, 2, 2, 0, 2]) 2 / 10
tensor([3, 0, 2, 1, 3, 1, 2, 4, 2, 2]) tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]) 4 / 10
tensor([2, 2, 1, 2, 1, 1, 1, 4, 3, 2]) tensor([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]) 4 / 10

# Remaining predictions are always [2, 2, 2...]
# Loss function is not shown, but it declines smoothly and looks well behaved

Although 2 is the most common category in the labels (about 50% of the images), I don’t see why the CNN should ‘concentrate’ on a single answer (0 in the above sample) or always predict 2 at the end.

I expected to get more varied results in the output tensors even if the accuracy wasn’t good enough. What am I doing wrong?

Here’s my code for the network…

class CNN(nn.Module):
    def __init__(self, n_layers=3, n_categories=5):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(n_layers, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.conv3 = nn.Conv2d(16, 16, 5)
        self.fc1 = nn.Linear(16 * 28 * 28, 200)
        self.fc2 = nn.Linear(200, 84)
        self.fc3 = nn.Linear(84, n_categories)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))        
        x = self.pool(F.relu(self.conv2(x)))
        x = self.pool(F.relu(self.conv3(x)))
        x = x.view(-1, 16 * 28 * 28)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

…the optimizer, loss function and dataloader…

model = CNN()

transforms = v2.Compose([
    v2.ToImageTensor(),
    v2.ConvertImageDtype(),
    v2.Resize((256, 256), antialias=True)
])

dataset = UBCDataset(transforms=transforms)
full_dataloader = DataLoader(dataset, batch_size=10, shuffle=False)

loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

…and the training loop that produced the above output. Loss function is not shown, but it declines smoothly as expected.

batches = iter(full_dataloader)

print("LABELS                                 OUTPUT                                 CORRECT")
for X, y in batches:   
    model.train()    
    
    pred = model(X)
    loss = loss_fn(pred, y)
    
    loss.backward()
    optimizer.step()
    #optimizer.zero_grad()
    
    print(f"{y} {pred.argmax(1)} {int(sum(y == pred.argmax(1)))} / {len(y)} {loss.item()}")

Even more puzzling, the output from the model (the pred variable in the training loop) always looks something like this:

tensor([[-0.2310,  0.1805,  0.7584, -0.7285, -0.7594],
        [-0.2310,  0.1806,  0.7585, -0.7286, -0.7592],
        [-0.2313,  0.1806,  0.7586, -0.7286, -0.7593]],
       grad_fn=<AddmmBackward0>)

Any input is appreciated.

Few things that come to my mind:

  1. Looking at your model architecture, I feel like your model is under-fitting. Maybe you would want to add more Conv2d layers. Also, nn.Linear(16 * 28 * 28, 200) has too many parameters to learn. You can downsample your images to much smaller dimension and then flatten it and pass it trough a couple of mlps.

  2. As you mentioned, class 2 constitutes about 50% of the dataset, that means your dataset is fairly imbalanced. You can try adding more data augmentations to balance out the dataset.

Edit:

Why did you comment out optimizer.zero_grad()? If you don’t zero out the gradients before the next backward pass, the new gradients will be added to the existing ones. This might be the problem.

Thank you for your input. At this point I was testing the loader and the training loop before I actually got down to polishing the model, so I fully expected an underfit. But such a glaringly defective result baffled me. I’m looking for some insight on how it came about—more out of curiosity because I did figure out what went wrong.

It turns out my dataloader had major problems—normalization was way out of whack—and the error function wasn’t tuned to account for different class weights.

Now that I’ve straightened these wrinkles, the results look plausible, but the model does underfit. :grin:

And you’re right about the zero_grad. I’ve put it back in as well.

Thanks again,
Paulo