Higher accuracy in the validation dataset

Hi all,

I am training resnet18 for image classification and I got good results but accuracy of the validation dataset is higher than the training dataset.

| train loss: 0.0059 | train acc 0.9621 | val loss: 0.007485 | val_acc: 0.9771

For the train dataset I used the weighted cross entropy as loss function since my dataset was imbalanced.

Any help is appreciated

Hi Azeai!

There are a number of possible reasons for this:

(First, a quick note: One would normally expect your model to perform
at least a little better on the training data.)

The character of your training and validation datasets could be
different. For example, maybe your validation images were collected
under better conditions and are less noisy. Then they might be easier
to classify so you would get a higher accuracy. Or maybe certain
classes are easier to classify than others and your validation dataset
has more samples of the easier classes in it. There’s nothing
fundamentally wrong with using distinctly different datasets for training
and validation, but the accuracies won’t be directly comparable.

It’s often the case that you have one large dataset that you randomly
split into training and validation (and possibly test) datasets. Now
your training and validation datasets will have the same character.
However, by happenstance, it could be that your validation dataset
nonetheless has more easier samples in it, so you get a higher
validation accuracy. You could test this by retraining your model
and performing the training / validation comparison multiple times
with different random splits for the training and validation datasets.
Even if for some specific runs you have (by happenstance) a higher
validation accuracy, you should find, on average, that your validation
accuracy is lower than your training accuracy.

If you use your validation dataset to select your final model, you
could bias your model to perform better on the validation dataset.
To be concrete, let’s say that your training protocol is to train for 100
epochs, and you keep as your final model the model from each of
the last 25 training epochs that has the highest validation-dataset
accuracy (or perhaps the lowest validation-dataset loss). This is not
an unreasonable thing to do – the quality of a model can jump around
during training, and you’re trying to pick the best one – but the chosen
model will be biased to perform especially well on the validation dataset,
even if it doesn’t perform as well (and it likely won’t) on an independent
test dataset.

If your model has separate “training” and “evaluation” modes (for
example, because it uses things like batch normalization or dropout),
your training-mode predictions and evaluation-mode predictions will
be different (even for the same input sample). If you want your training
and validation accuracies (and losses) to be directly comparable, you
can’t just collect your training accuracy during your training iterations;
you have to switch from training mode to evaluation mode when you
compute the training accuracy that you want to compare with your
validation accuracy.

Of course, you could have a bug somewhere that throws off one of
the accuracy calculations. There is no magic way to defend against
this – you just have to double-check your code and do some spot
checks of various results.

This will not affect whether your training and validation accuracies
are comparable – your loss function doesn’t (directly) affect your
accuracy calculation. However, if you want your training and
validation losses to be comparable, you should use the same
loss function – including any class weights – for both. (If your
training and validation datasets have the same character, they
will, in particular, have the same class imbalance, and it would
not be inappropriate to use class weights for your validation loss
calculation.)

Best.

K. Frank

1 Like

This is also the case in transfer learning tutorial. Can this be because we are augmenting our data? Meaning in the training set there are pictures that are cropped and so on. Hence it is harder to guess what they actually are.