Do I need to average loss and accuracy for sequential MNIST classification?

Dear PyTorch Community,

I am currenly working on a small sanity check for my RNN using sequential MNIST classification and was wondering whether I need to collect loss and other metrics like top1 accuracy and top5 accuracy in a list and then compute the average of the list ?

This is currently done in def (train_loader, model, optimizer, loss_f):. The train function is then being called in the def main(): where the training loop over the epochs is being called. Now please correct my understanding, the train function performs operations over each iteration within a single epoch. If this is the case then I should collect loss and metrics in a list, and average them once iterations are over, meaning that an epoch has ended, correct ?

def train (train_loader, model, optimizer, loss_f):
    '''
    Input: train loader (torch loader), model (torch model), optimizer (torch optimizer)
          loss function (torch custom yolov1 loss).
    Output: loss (torch float).
    '''
    model.train()
    loss_lst = []
    top1_acc_lst = []
    top5_acc_lst = []
    for batch_idx, (x, y) in enumerate(train_loader):
        x, y = x.to(device), y.to(device)
        # turn [64, 784] to [64, 784, 784]
        x_expanded = x[:, None, ...].expand(x.shape[0], x.shape[1], x.shape[1]).to(device)
        #x_expanded = x.reshape(-1, sequence_length, input_size)
        out = model(x_expanded)
        del x
        del x_expanded
        out = F.softmax(out, dim = 1)
        # store top1 accuracy, top5 accuracy and loss per iteration in list 
        top1_acc_lst.append(top1accuracy(out, y, batch_size))
        top5_acc_lst.append(top5accuracy(out, y, batch_size))
        loss_val = loss_f(out, y)
        loss_lst.append(float(loss_val.item()))
        del y
        del out
        optimizer.zero_grad()
        loss_val.backward()
        optimizer.step()
    # compute the average within each list to obtain final value for a single epoch
    top1_acc = lst_avg(top1_acc_lst)
    top5_acc = lst_avg(top5_acc_lst)
    loss_val =lst_avg(loss_lst)
    return (loss_val, top1_acc, top5_acc)
def main():
    print(f'Simple RNN initalised with {nlayers} layers and {hidden_size} number of hidden neurons.')
    model = SimpleRNN(input_size = input_size*input_size, hidden_size = hidden_size, num_layers=nlayers, output_size = 10, activation = 'relu').to(device)
    optimizer = optim.Adam(model.parameters(), lr = lr, weight_decay = weight_decay)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max = 145, eta_min = 0)
    loss_f = nn.CrossEntropyLoss()
    
    train_loss_lst = []
    test_loss_lst = []
    train_top1acc_lst = []
    test_top1acc_lst = []
    train_top5acc_lst = []
    test_top5acc_lst = []
    last_epoch = 0
    
    train_dataset = torchvision.datasets.MNIST(root = data_dir,
                                           train=True, 
                                           transform=T.Compose([T.ToTensor(), T.Lambda(torch.flatten)]),
                                           download=True)

    test_dataset = torchvision.datasets.MNIST(root =  data_dir,
                                          train = False, 
                                          transform=T.Compose([T.ToTensor(), T.Lambda(torch.flatten)]))
   
    train_loader = DataLoader(dataset=train_dataset,
                                           batch_size = batch_size, 
                                           shuffle = True)
    
    test_loader = DataLoader(dataset=test_dataset,
                                          batch_size = batch_size, 
                                          shuffle = False)

    for epoch in range(nepochs - last_epoch):
        
        # 1. linear increase from 0.00001 to 0.0001 over 5 epochs
        if epoch + last_epoch > 0 and epoch + last_epoch <= 5:
            optimizer.param_groups[0]['lr'] =  0.00001 +(0.00009/5) * (epoch + last_epoch)
        # 2. decrease from 0.0001 to 0 using cosine annealing 
        elif epoch + last_epoch > 5:
            scheduler.step()
        
        train_loss_value, train_top1acc_value, train_top5acc_value = train(train_loader, model, optimizer, loss_f)
        train_loss_lst.append(train_loss_value)
        train_top1acc_lst.append(train_top1acc_value)
        train_top5acc_lst.append(train_top5acc_value)
        
        test_loss_value, test_top1acc_value, test_top5acc_value  = test(test_loader, model, loss_f)
        test_loss_lst.append(test_loss_value)
        test_top1acc_lst.append(test_top1acc_value)
        test_top5acc_lst.append(test_top5acc_value)

        print(f"Epoch:{epoch + last_epoch + 1 }  Train[Loss:{train_loss_value}  Top5 Acc:{train_top5acc_value}  Top1 Acc:{train_top1acc_value}]")
        print(f"Epoch:{epoch + last_epoch + 1 }  Test[Loss:{test_loss_value}  Top5 Acc:{test_top5acc_value}  Top1 Acc:{test_top1acc_value}]")

However, there are a few things that strike me as a little odd. For one,
test accuracy seems to be always a little better than train accuracy with weight_decay = 0.0005 enforcing a very small regularisation. In theory and practice this could explain the small performance edge test has over train. However, the regularisation is small so what I am doing is incorrect. Furthermore, if I do not compute metrics by averaging, my accuray is maxed at 0.5. I suspected that this was the case, since I was initally only retaining the metric over the last iteration within an epoch.

However when averaging I know reach quite acceptable performance metrics when training that are well over 0.5 as can be seen below:

Simple RNN initalised with 2 layers and 64 number of hidden neurons.
Epoch:1  Train[Loss:2.282  Top5 Acc:0.6882  Top1 Acc:0.1736]
Epoch:1  Test[Loss:2.1989  Top5 Acc:0.8553  Top1 Acc:0.2897]
Epoch:2  Train[Loss:1.8816  Top5 Acc:0.9194  Top1 Acc:0.6155]
Epoch:2  Test[Loss:1.7117  Top5 Acc:0.9668  Top1 Acc:0.7689]
Epoch:3  Train[Loss:1.6802  Top5 Acc:0.9703  Top1 Acc:0.7997]
Epoch:3  Test[Loss:1.6411  Top5 Acc:0.9752  Top1 Acc:0.8336]
Epoch:4  Train[Loss:1.6395  Top5 Acc:0.9717  Top1 Acc:0.8345]
Epoch:4  Test[Loss:1.6024  Top5 Acc:0.9749  Top1 Acc:0.8623]
Epoch:5  Train[Loss:1.6008  Top5 Acc:0.9779  Top1 Acc:0.867]
Epoch:5  Test[Loss:1.5862  Top5 Acc:0.9718  Top1 Acc:0.8763]
Epoch:6  Train[Loss:1.6007  Top5 Acc:0.9717  Top1 Acc:0.865]
Epoch:6  Test[Loss:1.5916  Top5 Acc:0.9775  Top1 Acc:0.8713]

Please let me know what you think and whether my understanding is correct. I would be happy to learn.

Kind regards,
weight_thetas

You would have to check if your last batch might contain less samples and if you want to drop it via drop_last=True in the Dataloader as calculating the average of the accuracy metric might add a small bias as seen here:

# average using different batch sizes
tmp = 5/10 + 6/10 + 10/10 + 1/1
tmp / 4
# 0.775

# accuracy using number of samples
tmp = 5 + 6 + 10 + 1
tmp / 31
# 0.7096774193548387

In this example I’m using 31 samples with a batch size of 10 where the last batch contains a single sample only. While the actual accuracy would be 22/31 the average of the batch accuracies is adding a bias.

1 Like

Hi Patrick,

Thank you for your reply. I had a similar thought as in the last iteration what you describe did indeed happen. As a way to overcome this, I instead made the batchsize variable dependent on the batch information contained in the first index of x instead of a globally defined variable as followed:

batchsize = x_expanded.shape[0]
out = F.softmax(out, dim = 1)
top1_acc_lst.append(top1accuracy(out, y, batchsize))
top5_acc_lst.append(top5accuracy(out, y, batchsize))
  1. However, as you describe the bias by using a smaller batchsize in the last iteration would then still incurr correct ?
  2. What if I kept track of the total number of correct predictions and then average the total by the length of the train_loader.dataset as shown in the code below, would that overcome the bias ? We then make computing the accuracy dependent on the entire data used within an epoch, instead of a single iteration and the batch. What do you think ?
total = 0
for batch_idx, (x, y) in enumerate(train_loader):
        x, y = x.to(device), y.to(device)
        # turn [64, 784] to [64, 784, 784]
        x_expanded = x[:, None, ...].expand(x.shape[0], x.shape[1], x.shape[1]).to(device)
        out = model(x_expanded)
        out = F.softmax(out, dim = 1)
        pred = torch.argmax(out, dim = 1)
        total += (pred == y).sum
        loss_val = loss_f(out, y)
        loss_lst.append(float(loss_val.item()))
        optimizer.zero_grad()
        loss_val.backward()
        optimizer.step()
    # compute the average within each list to obtain final value for a single epoch
    acc = total / len(train_loader.dataset)
    return (loss_val, acc)

All the best,
weight_theta

Yes, this should be correct and would fit the second part of the code I’ve posted.

1 Like