RNN Mnist Test Loss & Accuracy better than Train Loss & Accuracy

Dear Community,

My vanilla RNN has a lower test loss and higher accuracy than my respective training metrics as shown in the prints shown below:

Simple RNN initalised with 1 layers and 6 number of hidden neurons.
Epoch:1  Train[Loss:2.29849  Top1 Acc:0.15428]
Epoch:1  Test[Loss:2.29271    Top1 Acc:0.243]
Epoch:2  Train[Loss:2.28381  Top1 Acc:0.27862]
Epoch:2  Test[Loss:2.27542    Top1 Acc:0.3129]
...
Epoch:7  Train[Loss:2.15966  Top1 Acc:0.47402]
Epoch:7  Test[Loss:2.14895    Top1 Acc:0.4761]
Epoch:8  Train[Loss:2.13435  Top1 Acc:0.4836]
Epoch:8  Test[Loss:2.1245    Top1 Acc:0.4859]

I have tried investigating this, by:

  1. Removing any regularisation (i.e., set weight decay to zero, no dropout or so was initially used)
  2. Checking the training loop and how accuracy is computed within. Here I corrrected for a small bias introduced by making accuracy dependent on the entire data rather than batchsize.
  3. Removed initalisation of weights, removed warmup with cosine annealing for a constant learning rate of 0.00001.

However despite this, test loss and accuracy still outperform training loss and accuracy by a small amount, regardless of which configuration I use. I have added my Rnncell, my SimpleRNN and maybe more importantly my training loop below. Any thoughts or idea on what could cause this behavior would be appreciated. Could these small differences be negledigble ?

The code will of course also run by copy pasting it into a google collab cell or notebook cell. Please let me know.

Imports

import torch
from torch import nn
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
import numpy as np

RnnCell

class RnnCell(nn.Module):
    def __init__(self, input_size, hidden_size, activation="tanh"):
        super(RnnCell, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.activation = activation
        if self.activation not in ["tanh", "relu", "sigmoid"]:
            raise ValueError("Invalid nonlinearity selected for RNN. Please use tanh, relu or sigmoid.")

        self.input2hidden = nn.Linear(input_size, hidden_size)
        # hidden2hidden when we have more than 1 RNN stacked
        # hidden2out when we have only 1 RNN
        self.hidden2hidden = nn.Linear(hidden_size, hidden_size)
        
        self.init_weights_normal()
        
    def forward(self, input, hidden_state = None):
        '''
        Inputs: input (torch tensor) of shape [batchsize, input_size]
                hidden state (torch tensor) of shape [batchsize, hiddensize]
        Output: output (torch tensor) of shape [batchsize, hiddensize ]
        '''

        # initalise hidden state at first iteration so if none
        if hidden_state is None:
            hidden_state = torch.zeros(input.shape[0], self.hidden_size).to(device)

        hidden_state = (self.input2hidden(input) + self.hidden2hidden(hidden_state))

        # takes output from hidden and apply activation
        if self.activation == "tanh":
            out = torch.tanh(hidden_state)
        elif self.activation == "relu":
            out = torch.relu(hidden_state)
        elif self.activation == "sigmoid":
            out = torch.sigmoid(hidden_state) 
        return out

    def init_weights_normal(self):
      # iterate over parameters or weights theta
      # and initalise them with a normal centered at 0 with 0.02 spread.
      for weight in self.parameters():
          weight.data.normal_(0, 0.02)

Simple RNN

class SimpleRNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, output_size, activation='relu'):
        super(SimpleRNN, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.output_size = output_size

        self.rnn_cell_list = nn.ModuleList()

        if activation == 'tanh':
            self.rnn_cell_list.append(RnnCell(self.input_size,
                                                   self.hidden_size,
                                                   "tanh"))
            for l in range(1, self.num_layers):
                self.rnn_cell_list.append(RnnCell(self.hidden_size,
                                                       self.hidden_size,
                                                       "tanh"))

        elif activation == 'relu':
            self.rnn_cell_list.append(RnnCell(self.input_size,
                                                   self.hidden_size,
                                                   "relu"))
            for l in range(1, self.num_layers):
                self.rnn_cell_list.append(RnnCell(self.hidden_size,
                                                   self.hidden_size,
                                                   "relu"))

        elif activation == 'sigmoid':
            self.rnn_cell_list.append(RnnCell(self.input_size,
                                                   self.hidden_size,
                                                   "sigmoid"))
            for l in range(1, self.num_layers):
                self.rnn_cell_list.append(RnnCell(self.hidden_size,
                                                   self.hidden_size,
                                                   "sigmoid"))
        else:
            raise ValueError("Invalid activation. Please use tanh, relu or sigmoid activation.")

        self.fc = nn.Linear(self.hidden_size, self.output_size)
        #self.sigmoid = nn.Sigmoid()

    def forward(self, input, hidden_state=None):
        '''
        Inputs: input (torch tensor) of shape [batchsize, seqence length, inputsize]
        Output: output (torch tensor) of shape [batchsize, outputsize]
        '''

        # initalise hidden state at first timestep so if none
        if hidden_state is None:
            # hidden_state_0 = torch.zeros(self.num_layers, input.size(0), self.hidden_size).to(device)
            hidden_state = torch.zeros(self.num_layers, input.size(0), self.hidden_size).to(device)
        # else set 
        #else:
             # hidden_state_0 = hidden_state
             

        outs = []

        hidden = list()
        for layer in range(self.num_layers):
            # hidden.append(hidden_state_0[layer, :, :])
            hidden.append(hidden_state[layer, :, :])
        for t in range(input.size(1)):

            for layer in range(self.num_layers):

                if layer == 0:
                    hidden_l = self.rnn_cell_list[layer](input[:, t, :], hidden[layer])
                else:
                    hidden_l = self.rnn_cell_list[layer](hidden[layer - 1], hidden[layer])
                hidden[layer] = hidden_l

                #hidden[layer] = hidden_l

            outs.append(hidden_l)

        # select last time step indexed at [-1]
        out = outs[-1].squeeze()
        out = self.fc(out)
        return out

Training

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
weight_decay = 0
sequence_length = 28*28
input_size = 28
hidden_size = 6
nlayers = 1
nclasses = 10
batch_size = 64
nepochs = 50
T_max = nepochs - 5
lr = 0.00001
save_model = True
continue_training = False

data_dir =  'data/'

def train (train_loader, model, optimizer, loss_f):
    '''
    Input: train loader (torch loader), model (torch model), optimizer (torch optimizer)
          loss function (torch custom yolov1 loss).
    Output: loss (torch float).
    '''
    model.train()
    loss_lst = []
    #top1_acc_lst = []
    #top5_acc_lst = []
    correct = 0 
    for batch_idx, (x, y) in enumerate(train_loader):
        x, y = x.to(device), y.to(device)
        # turn [64, 784] to [64, 784, 784]
        x_expanded = x[:, None, ...].expand(x.shape[0], x.shape[1], x.shape[1]).to(device)
        out = model(x_expanded)
        del x
        del x_expanded
        out = F.softmax(out, dim = 1)
        pred = torch.argmax(out, dim = 1)
        correct += sum(pred == y)
        loss_val = loss_f(out, y)
        loss_lst.append(float(loss_val.item()))
        optimizer.zero_grad()
        loss_val.backward()
        optimizer.step()
    # compute the average within each list to obtain final value for a single epoch
    loss_val = lst_avg(loss_lst)
    train_acc = round(correct.item() / len(train_loader.dataset), 5)
    return (loss_val, train_acc)

def test (test_loader, model, loss_f):
    '''
    Input: test loader (torch loader), model (torch model), loss function 
          (torch custom yolov1 loss).
    Output: test loss (torch float).
    '''
    test_loss_lst = []
    model.eval()
    correct = 0 
    with torch.no_grad():
        for batch_idx, (x, y) in enumerate(test_loader):
            x, y = x.to(device), y.to(device)
            x_expanded = x[:, None, ...].expand(x.shape[0], x.shape[1], x.shape[1]).to(device)
            out = model(x_expanded)
            del x
            batchsize = x_expanded.shape[0]
            del x_expanded
            out = F.softmax(out, dim = 1)

            pred = torch.argmax(out, dim = 1)
            correct += sum(pred == y)
            test_loss_val = loss_f(out, y)
            test_loss_lst.append(float(test_loss_val.item()))

        test_loss_val = lst_avg(test_loss_lst)
        test_acc = round(correct.item() / len(test_loader.dataset), 5)

        return (test_loss_val, test_acc)

def main():
    print(f'Simple RNN initalised with {nlayers} layers and {hidden_size} number of hidden neurons.')
    model = SimpleRNN(input_size = input_size*input_size, hidden_size = hidden_size, num_layers=nlayers, output_size = 10, activation = 'relu').to(device)
    optimizer = optim.Adam(model.parameters(), lr = lr, weight_decay = weight_decay)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max = 145, eta_min = 0)
    loss_f = nn.CrossEntropyLoss()
    
    train_loss_lst = []
    test_loss_lst = []
    train_top1acc_lst = []
    test_top1acc_lst = []

    last_epoch = 0

    train_dataset = torchvision.datasets.MNIST(root = data_dir,
                                           train=True, 
                                           transform=T.Compose([T.ToTensor(), T.Lambda(torch.flatten)]),
                                           download=True)

    test_dataset = torchvision.datasets.MNIST(root =  data_dir,
                                          train = False, 
                                          transform=T.Compose([T.ToTensor(), T.Lambda(torch.flatten)]))
   
    train_loader = DataLoader(dataset=train_dataset,
                                           batch_size = batch_size, 
                                           shuffle = True)
    
    test_loader = DataLoader(dataset=test_dataset,
                                          batch_size = batch_size, 
                                          shuffle = False)

    for epoch in range(nepochs - last_epoch):   
        train_loss_value, train_top1acc_value= train(train_loader, model, optimizer, loss_f)
        train_loss_lst.append(train_loss_value)
        train_top1acc_lst.append(train_top1acc_value)
        test_loss_value, test_top1acc_value = test(test_loader, model, loss_f)
        test_loss_lst.append(test_loss_value)
        test_top1acc_lst.append(test_top1acc_value)

       {test_top5acc_value}  Top1 Acc:{test_top1acc_value}]")
        print(f"Epoch:{epoch + last_epoch + 1 }  Train[Loss:{train_loss_value}  Top1 Acc:{train_top1acc_value}]")
        print(f"Epoch:{epoch + last_epoch + 1 }  Test[Loss:{test_loss_value}    Top1 Acc:{test_top1acc_value}]")


if __name__ == "__main__":
    main()

All the best,
weight_theta

Maybe that is because you calculate your loss on the training dataset, while you simultaneously take gradient descent steps in the process. Basically you measure your loss on a part of the training dataset and take a gradient descent step using a mini batch. You calculate your training loss using different weights: starting from relatively bad from the first batch and going to relatively good in the last batch.

When you use the test dataset you use a slightly better model and you have not managed yet to overfit.

You could try to finish training an epoch first and then measure your loss on the training dataset and see what happens.

1 Like

Thank you for your reply. I have tried this as it made sense, since I do an optimizer step within the train during each iteration after which the test is called. By making predictions and computing loss after the inital optimizer step once more I was able to reduce the loss i.e., make training loss lower than test loss:

Epoch:17  Train[Loss:2.55455  Top1 Acc:0.45765]
Epoch:17  Test[Loss:2.60654    Top1 Acc:0.4594]
Epoch:18  Train[Loss:2.60213  Top1 Acc:0.4684]
Epoch:18  Test[Loss:2.65454    Top1 Acc:0.4704]
Epoch:19  Train[Loss:2.64696  Top1 Acc:0.47678]
Epoch:19  Test[Loss:2.69964    Top1 Acc:0.4796]

However the test accuracy is still a little better, so it is still somewhat incorrect is it not ? Something is fishy. Could you have a look at my train test and my training loop loop as shown below? I would seriously appreciate it.

train and test

def train (train_loader, model, optimizer, loss_f):
    '''
    Input: train loader (torch loader), model (torch model), optimizer (torch optimizer)
          loss function (torch custom yolov1 loss).
    Output: loss (torch float).
    '''
    model.train()
    loss_lst = []

    correct = 0 
    for batch_idx, (x, y) in enumerate(train_loader):
        x, y = x.to(device), y.to(device)
        # turn [64, 784] to [64, 784, 784]
        x_expanded = x[:, None, ...].expand(x.shape[0], x.shape[1], x.shape[1]).to(device)
        out = model(x_expanded)
        class_prob = F.softmax(out, dim = 1)
        loss_val = loss_f(class_prob, y)
        del out, class_prob
        optimizer.zero_grad()
        loss_val.backward()
        optimizer.step()
        del loss_val
        
        # eval train on new updated weights 
        out = model(x_expanded)
        class_prob = F.softmax(out, dim = 1)
        loss_val = loss_f(out, y)
        loss_lst.append(float(loss_val.item()))
        del x_expanded
        pred = torch.argmax(class_prob, dim = 1)
        correct += sum(pred == y)

    loss_val = lst_avg(loss_lst)
    train_acc = round(correct.item() / len(train_loader.dataset), 5)
    return (loss_val, train_acc)

def test (test_loader, model, loss_f):
    '''
    Input: test loader (torch loader), model (torch model), loss function 
          (torch custom yolov1 loss).
    Output: test loss (torch float).
    '''
    test_loss_lst = []
    model.eval()
    correct = 0 
    with torch.no_grad():
        for batch_idx, (x, y) in enumerate(test_loader):
            x, y = x.to(device), y.to(device)
            x_expanded = x[:, None, ...].expand(x.shape[0], x.shape[1], x.shape[1]).to(device)
            out = model(x_expanded)

            class_prob = F.softmax(out, dim = 1)
            pred = torch.argmax(class_prob, dim = 1)
            correct += sum(pred == y)
            test_loss_val = loss_f(out, y)
            test_loss_lst.append(float(test_loss_val.item()))

        test_loss_val = lst_avg(test_loss_lst)
        test_acc = round(correct.item() / len(test_loader.dataset), 5)
        return (test_loss_val, test_acc)

train loop

def main():
    print(f'Simple RNN initalised with {nlayers} layers and {hidden_size} number of hidden neurons.')
    model = SimpleRNN(input_size = input_size*input_size, hidden_size = hidden_size, num_layers=nlayers, output_size = 10, activation = 'relu').to(device)
    optimizer = optim.Adam(model.parameters(), lr = lr, weight_decay = weight_decay)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max = 145, eta_min = 0)
    loss_f = nn.CrossEntropyLoss()
    
    train_loss_lst = []
    test_loss_lst = []
    train_top1acc_lst = []
    test_top1acc_lst = []

    last_epoch = 0

    train_dataset = torchvision.datasets.MNIST(root = data_dir,
                                           train=True, 
                                           transform=T.Compose([T.ToTensor(), T.Lambda(torch.flatten)]),
                                           download=True)

    test_dataset = torchvision.datasets.MNIST(root =  data_dir,
                                          train = False, 
                                          transform=T.Compose([T.ToTensor(), T.Lambda(torch.flatten)]))
   
    train_loader = DataLoader(dataset=train_dataset,
                                           batch_size = batch_size, 
                                           shuffle = True)
    
    test_loader = DataLoader(dataset=test_dataset,
                                          batch_size = batch_size, 
                                          shuffle = False)

    for epoch in range(nepochs - last_epoch):
        
        train_loss_value, train_top1acc_value= train(train_loader, model, optimizer, loss_f)
        train_loss_lst.append(train_loss_value)
        train_top1acc_lst.append(train_top1acc_value)
     
        test_loss_value, test_top1acc_value = test(test_loader, model, loss_f)
        test_loss_lst.append(test_loss_value)
        test_top1acc_lst.append(test_top1acc_value)

        print(f"Epoch:{epoch + last_epoch + 1 }  Train[Loss:{train_loss_value}  Top1 Acc:{train_top1acc_value}]")
        print(f"Epoch:{epoch + last_epoch + 1 }  Test[Loss:{test_loss_value}    Top1 Acc:{test_top1acc_value}]")

        
if __name__ == "__main__":
    main()

Hey @weight_theta I can’t run your full code, I don’t know how lst_avg is implemented in your code. But I think I see one potential reason why you might see differences. I changed the lower part of your train function. In this new implementation the training epoch is completed, before running any metrics calculations.

def train (train_loader, model, optimizer, loss_f):
    '''
    Input: train loader (torch loader), model (torch model), optimizer (torch optimizer)
          loss function (torch custom yolov1 loss).
    Output: loss (torch float).
    '''
    model.train()
    loss_lst = []

    correct = 0 
    for batch_idx, (x, y) in enumerate(train_loader):
        x, y = x.to(device), y.to(device)
        # turn [64, 784] to [64, 784, 784]
        x_expanded = x[:, None, ...].expand(x.shape[0], x.shape[1], x.shape[1]).to(device)
        out = model(x_expanded)
        class_prob = F.softmax(out, dim = 1)
        loss_val = loss_f(class_prob, y)
        del out, class_prob
        optimizer.zero_grad()
        loss_val.backward()
        optimizer.step()
        del loss_val
        
    # eval train on new updated weights 
    (train_loss, train_acc) = test(train_loader, model, loss_f)
    return train_loss, train_acc

Hi thank you for your reply. lst_vg is the average of a list so sum(lst) / len(lst).
I have thoroughly played around with your suggestion and made some changes to the code, I now ensure that eval is called once training has ended for both the train and test set (see training loop below).

However, once I increase the number of neurons to more than 4 i.e., 16 and start stacking two RNNs on top of each other test accuracy starts to become higher than train after afew epochs and, with loss also turning out to be better in the test case rather than train cases. Its kind of suspicious.
Is there any way to explain this behavior ? In theory it shoudnt happen.

Epoch, loss, accuracy

Simple RNN initalised with 1 layers and 16 number of hidden neurons.
Epoch:1  Train[Loss:2.2168  Top1 Acc:0.4582]
Epoch:1  Test[Loss:2.2172    Top1 Acc:0.4504]
Epoch:2  Train[Loss:2.0819  Top1 Acc:0.5346]
Epoch:2  Test[Loss:2.0823    Top1 Acc:0.5315]
....
Epoch:7  Train[Loss:1.7748  Top1 Acc:0.6956]
Epoch:7  Test[Loss:1.7714    Top1 Acc:0.6978]
Epoch:8  Train[Loss:1.7642  Top1 Acc:0.6999]
Epoch:8  Test[Loss:1.7602    Top1 Acc:0.7049]

train, eval and main loop

def train (train_loader, model, optimizer, loss_f):
    '''
    Performs the training loop. 
    Input: train loader (torch loader)
           model (torch model)
           optimizer (torch optimizer)
           loss function (torch loss).
    Output: No output.
    '''
    model.train()
    correct = 0 
    for batch_idx, (x, y) in enumerate(train_loader):
        x, y = x.to(device), y.to(device)
        # turn [64, 784] to [64, 784, 784]
        x_expanded = x[:, None, ...].expand(x.shape[0], x.shape[1], x.shape[1]).to(device)
        out = model(x_expanded)
        del x
        class_prob = F.softmax(out, dim = 1)
        loss_val = loss_f(class_prob, y)
        del out, class_prob
        #optimizer.zero_grad()
        loss_val.backward()
        optimizer.step()
        del loss_val    
    return None 

def evaluate (data_loader, model, loss_f):
    '''
    Input: test or train loader (torch loader) 
           model (torch model)
           loss function (torch loss)
    Output: loss (torch float)
            accuracy (torch float)
    '''
    loss_lst = []
    model.eval()
    correct = 0 
    with torch.no_grad():
        for batch_idx, (x, y) in enumerate(data_loader):
            x, y = x.to(device), y.to(device)
            x_expanded = x[:, None, ...].expand(x.shape[0], x.shape[1], x.shape[1]).to(device)
            out = model(x_expanded)
            del x
            del x_expanded
            class_prob = F.softmax(out, dim = 1)
            pred = torch.argmax(class_prob, dim = 1)
            correct += sum(pred == y)
            loss_val = loss_f(class_prob, y)
            loss_lst.append(float(loss_val.item()))
            del y, out
        # compute average loss
        loss_val = round(sum(loss_lst) / len(loss_lst), 4)
        acc = round(correct.item() / len(data_loader.dataset), 4)
        return (loss_val, acc)

def main():
    print(f'Simple RNN initalised with {nlayers} layers and {hidden_size} number of hidden neurons.')
    model = SimpleRNN(input_size = input_size*input_size, hidden_size = hidden_size, num_layers=nlayers, output_size = 10, activation = 'relu').to(device)
    optimizer = optim.Adam(model.parameters(), lr = lr, weight_decay = weight_decay)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max = 145, eta_min = 0)
    loss_f = nn.CrossEntropyLoss()
    
    train_loss_lst = []
    test_loss_lst = []
    train_top1acc_lst = []
    test_top1acc_lst = []
    last_epoch = 0
    
    train_dataset = torchvision.datasets.MNIST(root = data_dir,
                                           train=True, 
                                           transform=T.Compose([T.ToTensor(), T.Lambda(torch.flatten)]),
                                           download=True)

    test_dataset = torchvision.datasets.MNIST(root =  data_dir,
                                          train = False, 
                                          transform=T.Compose([T.ToTensor(), T.Lambda(torch.flatten)]))
   
    # we drop the last batch since it is smaller than the original batch size
    # and accuracy is affected by varying batch size
    train_loader = DataLoader(dataset = train_dataset,
                                           batch_size = batch_size, 
                                           shuffle = True, drop_last = True)
    
    test_loader = DataLoader(dataset = test_dataset,
                                          batch_size = batch_size, 
                                          shuffle = False, drop_last = True)

    for epoch in range(nepochs - last_epoch):
        
        # train 
        train(train_loader, model, optimizer, loss_f)
        train_loss_value, train_top1acc_value = evaluate(train_loader, model, loss_f)
        train_loss_lst.append(train_loss_value)
        train_top1acc_lst.append(train_top1acc_value)
        
        # test 
        test_loss_value, test_top1acc_value = evaluate(test_loader, model, loss_f)
        test_loss_lst.append(test_loss_value)
        test_top1acc_lst.append(test_top1acc_value)

        print(f"Epoch:{epoch + last_epoch + 1 }  Train[Loss:{train_loss_value}  Top1 Acc:{train_top1acc_value}]")
        print(f"Epoch:{epoch + last_epoch + 1 }  Test[Loss:{test_loss_value}    Top1 Acc:{test_top1acc_value}]")

if __name__ == "__main__":
    main()

I assume, that you expect to see some sort of overfitting there. Test loss should be higher than training loss and test accuracy should be lower than training accuracy. But it looks like you are not even close to overfitting. For MNIST it is relatively easy to get to a testing accuracy of over 95% with a plain vanilla fully connected neural network. My suggestion is to simply train longer until you train and test losses start to diverge. So far both are decreasing, there is no overfitting.

You can also look at the PyTorch tutorial here. They also use a recurrent neural network to deal with MNIST, although it is a different approach.