Worse performance when validating the model

tslermen · October 18, 2021, 10:58pm

Hello,

I’m training a model which seems like it’s learning well during the training step. Even though, during the validation step the performance is terrible, and seems like the model didn’t learn at all. The really low accuracy on the validation set is perceptive at the second epoch and it keeps going lower and lower during the training, while the training accuracy keeps going up normally.

I’m fine tuning a MobileNetV3_small from torchvision models and training it from scratch or pretrained just freezing its weights. Both approaches are having the same problem.

I’ve tried to remove the model.eval() and model.train() from my code, and the performance seems to be so much better.

I don’t know why it’s happening. The training and validation sets seem to be working well.

tumble-weed · October 19, 2021, 3:43am

hi,
you mention

The really low accuracy on the validation set is perceptive

as well as:

The training and validation sets seem to be working well

could you clarify, it seems both cannot be the case

tslermen · October 19, 2021, 3:47am

Hey! Thanks for your reply. What I meant in the first sentence is that the accuracy during the validation step is really low and in the second sentence, I meant that the train and val datasets seem to be correct.

tumble-weed · October 19, 2021, 12:21pm

ah i see. so if i get this correct that the training accuracy seems to be improving while the validation accuracy is getting worse. so it would seem like a standard overfitting problem. so is the training set or the validation set too small. if training is too small, then it directly implies not enough samples to learn from, if validation is too small, then you might end up with a val set of difficult samples that do not reflect how good or bad the network is.

however i’ll assume that when you say validation and train sets are fine, that this might not be the case. if so then it might be a coding issue, like dissimilar transforms on train and val etc. to rule this out i would suggest using the train set as the train as well as the val set. if the code is fine then you would see what is expected, improved accuracy from the train loss and the validation part of your code, else you will see a mismatch.

thats all i could think of at the moment

tslermen · October 19, 2021, 1:16pm

I’ve testing some things and apparently the code runs without bugs. Turns out that the problem can be related to model.eval(), because when I remove this line from my training functions, the loss and the accuracy both in train and validation decreases and increases, respectively. The model starts to overfit after the 20th epoch.

Even though I searched some similar problems, but i couldn’t find anything helpful. I found this error really weird. I don’t know if I’m getting the correct values of loss and accuracy without the model.eval() line, though

Thanks for your reply, again.

tumble-weed · October 20, 2021, 1:10am

I once had problems with model.eval because i was using batch normalization and couldnt support a large enough batch size. if you are using BN, could you try putting only those layers in eval mode while training and see if the train-val behave similarly?

tslermen · November 9, 2021, 3:28am

Hello. Sorry for the delay and thanks for your reply. I kept training and valuating the network without the model.eval() mode. Even though, when I the model to run with the test set the same problem happens.

I tried to run the BN layers in eval mode during training and the performance during training is also degraded.

I saw some modifications on track_running_stats, setting it to False. But it didn’t work for me.

Apparently it has to do with the PyTorch’s BN implementation. But I still don’t understand why this problem is happening.