Gap between training and validation loss

I created model whose main purpose is to classify images based on emotions of the people involved in it. I have used resnet architecture and trained model using SGD optimizer. While training for initial 20 epochs I followed cyclic learning rate approach after that I started decreasing lr by 0.1 factor after every 10 epoch. I trained model for total 50 epochs.
While training I observed that during initial 20 epochs the model converges very fast later on after decreasing the learning by 0.1 factor it still converges properly at least till 30th epoch. But after that training loss decreases properly and validation loss decreases but will little oscillations in between. Also after 30th epoch it is observed that the gap between training and validation loss increases.

I am not getting why this happening because both the losses are converging but then with huge difference. I tried with tuning all the hyperparameters especially learning rate but still the same issue.
Can someone please tell where I am going wrong or may be better solution for this issue is welcome.

Below is the loss that I observed while training:-

Epoch: 20/50..  Training Loss: 1.402..  Validation Loss: 1.432..  Training Accuracy:  48.128..  Validation Accuracy: 47.815
Epoch: 21/50..  Training Loss: 1.359..  Validation Loss: 1.398..  Training Accuracy:  49.881..  Validation Accuracy: 49.033
Epoch: 22/50..  Training Loss: 1.331..  Validation Loss: 1.387..  Training Accuracy:  50.989..  Validation Accuracy: 49.582
Epoch: 23/50..  Training Loss: 1.305..  Validation Loss: 1.360..  Training Accuracy:  52.486..  Validation Accuracy: 50.515
Epoch: 24/50..  Training Loss: 1.273..  Validation Loss: 1.354..  Training Accuracy:  53.513..  Validation Accuracy: 50.735
Epoch: 25/50..  Training Loss: 1.244..  Validation Loss: 1.358..  Training Accuracy:  54.662..  Validation Accuracy: 51.000
Epoch: 26/50..  Training Loss: 1.209..  Validation Loss: 1.346..  Training Accuracy:  56.040..  Validation Accuracy: 51.047
Epoch: 27/50..  Training Loss: 1.184..  Validation Loss: 1.297..  Training Accuracy:  57.012..  Validation Accuracy: 53.435
Epoch: 28/50..  Training Loss: 1.154..  Validation Loss: 1.321..  Training Accuracy:  58.051..  Validation Accuracy: 51.720
Epoch: 29/50..  Training Loss: 1.114..  Validation Loss: 1.330..  Training Accuracy:  59.681..  Validation Accuracy: 52.743
Epoch: 30/50..  Training Loss: 1.084..  Validation Loss: 1.291..  Training Accuracy:  61.053..  Validation Accuracy: 52.894
Epoch: 31/50..  Training Loss: 1.052..  Validation Loss: 1.273..  Training Accuracy:  62.034..  Validation Accuracy: 54.937
Epoch: 32/50..  Training Loss: 1.018..  Validation Loss: 1.362..  Training Accuracy:  63.522..  Validation Accuracy: 52.836
Epoch: 33/50..  Training Loss: 0.980..  Validation Loss: 1.217..  Training Accuracy:  64.871..  Validation Accuracy: 57.392
Epoch: 34/50..  Training Loss: 0.940..  Validation Loss: 1.290..  Training Accuracy:  66.247..  Validation Accuracy: 54.176
Epoch: 35/50..  Training Loss: 0.904..  Validation Loss: 1.237..  Training Accuracy:  67.798..  Validation Accuracy: 57.181
Epoch: 36/50..  Training Loss: 0.858..  Validation Loss: 1.265..  Training Accuracy:  69.469..  Validation Accuracy: 56.388
Epoch: 37/50..  Training Loss: 0.824..  Validation Loss: 1.266..  Training Accuracy:  70.768..  Validation Accuracy: 56.755
Epoch: 39/50..  Training Loss: 0.743..  Validation Loss: 1.213..  Training Accuracy:  73.805..  Validation Accuracy: 58.673

From the above logs we can see that at 40th epoch training loss is 0.743 but validation loss in higher than that due to which its accuracy is also very low.

Hi @ashi,
This behaviour of ML models is expected. You can notice that even though there is a gap between the validation loss and train loss, the general trend in both is of decrease. The graph of validation loss will be noisier than that of the train loss.

Possible steps to tackle the problem.

  1. Train for more epochs, gives model more time in searching for global minimums before settling for some local minimums.
  2. Increase model capacity, Over parameterization as discussed here may help in achieving lower validation loss.
    Hope this helps!
1 Like

I tried training for 70 epochs and it is able to reach almost 90% training accuracy but validation loss comes just near 70%.
But yes I can try for increasing the model capacity hoping that it will help.

1 Like