Gap between training and validation loss

ashi · January 9, 2020, 9:26am

I created model whose main purpose is to classify images based on emotions of the people involved in it. I have used resnet architecture and trained model using SGD optimizer. While training for initial 20 epochs I followed cyclic learning rate approach after that I started decreasing lr by 0.1 factor after every 10 epoch. I trained model for total 50 epochs.
While training I observed that during initial 20 epochs the model converges very fast later on after decreasing the learning by 0.1 factor it still converges properly at least till 30th epoch. But after that training loss decreases properly and validation loss decreases but will little oscillations in between. Also after 30th epoch it is observed that the gap between training and validation loss increases.

I am not getting why this happening because both the losses are converging but then with huge difference. I tried with tuning all the hyperparameters especially learning rate but still the same issue.
Can someone please tell where I am going wrong or may be better solution for this issue is welcome.

Below is the loss that I observed while training:-

Epoch: 20/50..  Training Loss: 1.402..  Validation Loss: 1.432..  Training Accuracy:  48.128..  Validation Accuracy: 47.815
21
Epoch: 21/50..  Training Loss: 1.359..  Validation Loss: 1.398..  Training Accuracy:  49.881..  Validation Accuracy: 49.033
22
Epoch: 22/50..  Training Loss: 1.331..  Validation Loss: 1.387..  Training Accuracy:  50.989..  Validation Accuracy: 49.582
23
Epoch: 23/50..  Training Loss: 1.305..  Validation Loss: 1.360..  Training Accuracy:  52.486..  Validation Accuracy: 50.515
24
Epoch: 24/50..  Training Loss: 1.273..  Validation Loss: 1.354..  Training Accuracy:  53.513..  Validation Accuracy: 50.735
25
Epoch: 25/50..  Training Loss: 1.244..  Validation Loss: 1.358..  Training Accuracy:  54.662..  Validation Accuracy: 51.000
26
Epoch: 26/50..  Training Loss: 1.209..  Validation Loss: 1.346..  Training Accuracy:  56.040..  Validation Accuracy: 51.047
27
Epoch: 27/50..  Training Loss: 1.184..  Validation Loss: 1.297..  Training Accuracy:  57.012..  Validation Accuracy: 53.435
28
Epoch: 28/50..  Training Loss: 1.154..  Validation Loss: 1.321..  Training Accuracy:  58.051..  Validation Accuracy: 51.720
29
Epoch: 29/50..  Training Loss: 1.114..  Validation Loss: 1.330..  Training Accuracy:  59.681..  Validation Accuracy: 52.743
30
Epoch: 30/50..  Training Loss: 1.084..  Validation Loss: 1.291..  Training Accuracy:  61.053..  Validation Accuracy: 52.894
31
Epoch: 31/50..  Training Loss: 1.052..  Validation Loss: 1.273..  Training Accuracy:  62.034..  Validation Accuracy: 54.937
32
Epoch: 32/50..  Training Loss: 1.018..  Validation Loss: 1.362..  Training Accuracy:  63.522..  Validation Accuracy: 52.836
33
Epoch: 33/50..  Training Loss: 0.980..  Validation Loss: 1.217..  Training Accuracy:  64.871..  Validation Accuracy: 57.392
34
Epoch: 34/50..  Training Loss: 0.940..  Validation Loss: 1.290..  Training Accuracy:  66.247..  Validation Accuracy: 54.176
35
Epoch: 35/50..  Training Loss: 0.904..  Validation Loss: 1.237..  Training Accuracy:  67.798..  Validation Accuracy: 57.181
36
Epoch: 36/50..  Training Loss: 0.858..  Validation Loss: 1.265..  Training Accuracy:  69.469..  Validation Accuracy: 56.388
37
Epoch: 37/50..  Training Loss: 0.824..  Validation Loss: 1.266..  Training Accuracy:  70.768..  Validation Accuracy: 56.755
38
Epoch: 39/50..  Training Loss: 0.743..  Validation Loss: 1.213..  Training Accuracy:  73.805..  Validation Accuracy: 58.673
40

From the above logs we can see that at 40th epoch training loss is 0.743 but validation loss in higher than that due to which its accuracy is also very low.

Mazhar_Shaikh · January 9, 2020, 9:56am

Hi @ashi,
This behaviour of ML models is expected. You can notice that even though there is a gap between the validation loss and train loss, the general trend in both is of decrease. The graph of validation loss will be noisier than that of the train loss.

Possible steps to tackle the problem.

Train for more epochs, gives model more time in searching for global minimums before settling for some local minimums.
Increase model capacity, Over parameterization as discussed here may help in achieving lower validation loss.
Hope this helps!

ashi · January 9, 2020, 11:06am

I tried training for 70 epochs and it is able to reach almost 90% training accuracy but validation loss comes just near 70%.
But yes I can try for increasing the model capacity hoping that it will help.