Poor Performance for the MNIST Digits problem (using MSELoos and SGD)

akib62 · September 15, 2021, 9:05pm

I want to consider MNIST digits as a regression problem (as we can do for the house price prediction).

I used MSELoos and SGD optimizer. The last layer of the CNN model is linear with one neuron. The structure of the model is given below

class CNN(nn.Module):
        def __init__(self):
            super(CNN, self).__init__()
            self.conv1 = nn.Sequential(        
                nn.Conv2d(in_channels=1, out_channels=16, kernel_size=5,stride=1,padding=2,),                              
                nn.ReLU(),                      
                nn.MaxPool2d(kernel_size=2),    
            )
            self.conv2 = nn.Sequential(        
                nn.Conv2d(16, 32, 5, 1, 2),    
                nn.ReLU(),                      
                nn.MaxPool2d(2),                
            )
            self.out = nn.Linear(32 * 7 * 7, 1)
        def forward(self, x):
            x = self.conv1(x)
            x = self.conv2(x)
            x = x.view(x.size(0), -1)      
            output = self.out(x)
            return output, x

Train code snippets

optimizer = optim.SGD(params=model.parameters(), lr=LR)
criterion = nn.MSELoss()

def train(NB_EPOCS, model, loaders):
    model.train()
    total_step = len(loaders['train'])
    for epoch in range(NB_EPOCS):
        for i, (images, labels) in enumerate(loaders['train']):
            b_x = Variable(images)   # batch x
            b_y = Variable(labels)   # batch y
            output, _ = model(b_x)  
            loss = criterion(output, b_y.float())
            optimizer.zero_grad()          
            loss.backward()    
            optimizer.step()            
            if (i+1) % 100 == 0:
                print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'
                       .format(epoch + 1, NB_EPOCS, i + 1, total_step, loss.item()))

I am getting a warning

UserWarning: Using a target size (torch.Size([100])) that is different to the input size (torch.Size([100, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  return F.mse_loss(input, target, reduction=self.reduction)

The loss is very high

Epoch [1/10], Step [100/600], Loss: 7.7500
Epoch [1/10], Step [200/600], Loss: 8.3003
Epoch [1/10], Step [300/600], Loss: 8.7280
Epoch [1/10], Step [400/600], Loss: 8.4920
Epoch [1/10], Step [500/600], Loss: 8.5399
Epoch [1/10], Step [600/600], Loss: 8.8300
Epoch [2/10], Step [100/600], Loss: 10.8930
Epoch [2/10], Step [200/600], Loss: 10.0020
Epoch [2/10], Step [300/600], Loss: 7.9896
Epoch [2/10], Step [400/600], Loss: 7.2748
Epoch [2/10], Step [500/600], Loss: 9.4017

However, at the time of testing, the model is predicting all images as 0.

Test Accuracy of the model on the 10000 test images: %.2f 0.16
Prediction Number: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Actual Number [3 8 1 8 3 0 3 9 0 9 3 3 3 1 2 9 7 4 9 1 4 7 4 7 4 3 7 3 1 4 1 7 4 1 6 2 5
 0 0 6 8 8 3 2 5 1 6 3 9 3 8 1 4 1 7 7 5 8 3 2 0 4 3 5 9 3 9 4 7 6 0 7 2 3
 9 2 6 7 3 5 6 8 2 3 7 2 6 5 6 3 6 4 5 0 0 7 0 7 6 6]

I think the warning is one of the reasons for the poor performance. Could you tell me how can I resolve the warning and improve the performance?

Unity05 · September 15, 2021, 9:33pm

Hi,
you need to squeeze your input tensor, so that input and target are of same size.
Just like the warning says: broadcasting ruins your loss computation.
In your specific case, PyTorch would broadcast both tensors to tensors of size ([100, 100]).

akib62 · September 15, 2021, 9:35pm

@Unity05 could you tell me, how I have to do squeeze?

Unity05 · September 15, 2021, 9:41pm

loss = criterion(output.squeeze(-1), b_y.float())
Btw, Variables are deprecated. Nowadays, normal tensors can have gradients as well.

akib62 · September 16, 2021, 1:19pm

@Unity05 Thanks a lot. It works. Now the loss is not terrible! And the training phase is not bad,

Training phase prediction

[ 7.6779],
        [ 4.2894],
        [ 4.9621],
        [ 3.0036],
        [ 1.2166],
        [ 4.3545],
        [ 6.1130],
        [ 8.3136],
        [ 6.8330],
        [ 2.3176],
        [ 4.7833],
        [-0.7525],
        [ 3.9115],
        [ 3.6507]]

Training phase actual

[9, 6, 6, 2, 1, 2, 8, 9, 7, 3, 6, 0, 5, 5]

But at the time of testing I am getting 0 not any numbers!!

Test Accuracy of the model on the 10000 test images: %.2f 0.05
Prediction Number: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Actual Number [9 0 5 4 9 4 9 9 6 4 8 6 3 3 7 8 8 9 6 6 7 2 8 1 6 6 7 9 2 0 1 8 9 4 3 2 6
 6 0 1 5 2 7 2 8 2 0 5 7 4 6 2 4 2 1 2 8 2 2 9 6 8 1 7 3 4 1 4 2 3 1 6 4 8
 7 0 2 5 5 1 4 4 5 1 1 3 7 1 9 0 7 4 2 0 7 0 2 8 5 3]

Any idea, why this is happening? Or I am doing something wrong at the time of testing?

#Loading Model
model = my_model.get_model()
model.load_state_dict(torch.load(os.path.join(CHECKPOINT, "baseline.h5")))

#loading test data
loaders = preprocess.data_loaders()

def test():
    # Test the model
    with torch.no_grad():
        correct = 0
        total = 0
        for images, labels in loaders['test']:
            test_output, last_layer = model(images)
            pred_y = torch.max(test_output, 1)[1].data.squeeze()
            accuracy = (pred_y == labels).sum().item() / float(labels.size(0))
            
    print('Test Accuracy of the model on the 10000 test images: %.2f', accuracy)
    
    
test()

This is the actual prediction from test images

        [ 7.0612],
        [ 4.3061],
        [ 6.4712],
        [ 0.7163],
        [ 3.1995],
        [ 2.7388],
        [ 3.8326],
        [ 3.6017],
        [ 7.4603],
        [ 0.7398],
        [ 8.2723],
        [ 1.7648]], grad_fn=<AddmmBackward>)

I think, I am doing something wrong in this line

pred_y = torch.max(test_output, 1)[1].data.squeeze()

Unity05 · September 16, 2021, 2:37pm

That happens because the second output of torch.max() contains the respective indices.
Furthermore, you set dim=1, but your outputs only have ine element for dim 1, so the argmax always is 0. I guess, you want to change your network to have 10 output nodes to regress the probabilities for each digit?

akib62 · September 16, 2021, 2:40pm

No, I need only one value and I used the bellow line and working fine.

pred_y = test_output.squeeze()

Unity05 · September 16, 2021, 2:42pm

Okay then.
May I ask why you don’t train a classifier for MNIST?
And btw, it’s more expressive to use an AverageMeter for the loss logs instead of only one batche’s loss.

akib62 · September 16, 2021, 3:07pm

@Unity05 Out of curiosity. I didn’t get you “AverageMeter for the loss logs”, could you tell me a little more?

Unity05 · September 16, 2021, 3:29pm

That’s just taking the average over all batches in that intervall.

Samuel_Bachorik · September 16, 2021, 3:46pm

Hi @akib62 , for good performance on MNIST use this model
with nn.NLLLoss()
You will get over 99% accuracy with learning rate=0.0001,optimizer Adam and 30 epochs.
I got 99,56% with this model
Model

from torch import nn
import torch
class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, stride=2, padding=1)
        self.activation1 = nn.ReLU()

        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=2, padding=1)
        self.activation2 = nn.ReLU()

        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=1)
        self.activation3 = nn.ReLU()
        self.conv4 = nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=0)
        self.activation4 = nn.ReLU()

        self.linear1 = nn.Linear(128 * 5 * 5, 10)
        self.soft = nn.LogSoftmax(dim=1)

    def forward(self, xb):
        xb = self.conv1(xb)
        xb = self.activation1(xb)

        xb = self.conv2(xb)
        xb = self.activation2(xb)

        xb = self.conv3(xb)
        xb = self.activation3(xb)
        xb = self.conv4(xb)
        xb = self.activation4(xb)

        xb = xb.reshape(-1, 128 * 5 * 5)

        xb = self.linear1(xb)
        xb = self.soft(xb)

        return xb

If you are interested here is full training loop and test of accuracy on my github, you can try