Unexpectedly high results on EMNIST-Letters with no augmentation and simple CNN

After training a simple CNN with EMNIST-letters I always end up with a classification accuracy close to 93.7%, which feels a bit high. Are these results unrealistic? Is there anything I am doing wrong based on the code implementation below?

Here’s the model used in the experiments:

  model = nn.Sequential(


  return model

Here’s the train function:

def train(numb_epoch=3, lr=1e-3, device="cpu"):
  accuracies = []
  cnn = cnn_model().to(device)
  cec = nn.CrossEntropyLoss()
  optimizer = optim.Adam(cnn.parameters(), lr=lr)
  max_accuracy = 0
  averageTime = 0

  for epoch in range(numb_epoch):
    t0 = time.time()
    for i, (images, labels) in enumerate(train_loader):
      images = images.to(device)
      labels = labels.to(device)
      pred = cnn(images)
      loss = cec(pred, labels)
    accuracy = float(validate(cnn,test_loader))
    if accuracy > max_accuracy:
      best_model = copy.deepcopy(cnn)
      max_accuracy = accuracy
      print("Saving  Best Model with Accuracy: ", accuracy)
    averageTime += (time.time() - t0)
    print("Time at Epoch ", epoch+1, ' {} seconds'.format(averageTime))
    print("Epoch: ", epoch+1, " Accuracy: ", accuracy, "%")
  averageTime = averageTime/numb_epoch
  print("Average epoch time = ", '{} seconds'.format(averageTime))
  return best_model

Here’s the vaidation function:

def validate(model,data):
  total = 0
  correct = 0

  for i, (images, labels) in enumerate(data):
    images = images.cuda()
    x = model(images)
    value, pred = torch.max(x,1)
    pred = pred.data.cpu()
    total += x.size(0)
    correct += torch.sum(pred == labels)
  return correct*100./total

I don’t know what the “standard” data split for EMNIST-Letters would be, but it seems you are checking the highest validation accuracy, which might be invalid. In a common setup, you would split the data into a training, validation, and test dataset. The training and validation splits are used to train and validate the model, respectively, while the unseen test set should be used once after the training is finished (e.g. after you’ve used early stopping using the validation accuracy).
If you are using a two-way split, one could argue that the validation data was already “seen” or “used” in the training process, since you are using the validation accuracy to store the best model.

In any case, I don’t know if there are predefined training, validation, test splits or how e.g. this benchmark on paperswithcode was created.

You’re right @ptrblck , I was checking the highest validation accuracy!

EMNIST-Letters is by default split into a training set (124800 samples) and test set (20800).
Now I decided to split the training data into training set (104000 samples) and validation set (20800).
In other words the data is split up in the following way; training set (70%), validation set (15%) and test set (15%).

However, when training the model and then measuring its accuracy on the test set, I see that test accuracy is significantly higher than training accuracy, which feels a bit weird. Sometimes the test an training accuracy are differing by 10 percentage units.

Are there any reasons for this output? I was thinking that it might have to do with the training set being significantly bigger than the test set…

EDIT: Reading in other forums it seems like the problem can arise since I am augmenting my training set and validation set which makes it harder for the model to make predictions, i.e lower training accuracy. In the test set, the data might be easier which result in higher test accuracy.