Weird behaviour after resume training

Jia_Ming · January 17, 2021, 5:25am

I encounter a weird situation when resume training. Let’s say I have 10 epochs. From 1st epoch, the train loss decrease steadily whereas the valid loss is floating around 1.8-2.0. At 5th epoch, the training loss is 0.95 with validation loss of 1.7. I save this epoch with model state dict and optimizer state dict. The first epoch when I resume training is that the training loss rise a bit, to 1.05, whereas the validation loss drop to 1.1. After that, the training loss continuously drop, whereas the validation loss constantly rise to 1.6.

My question is, I have save the state dict of model and optimizer, why would the valid loss is not following the trend of the previous when I resume training. My expectation of the valid loss after resume training is that it will drop from 1.7, but the fact is it drop to 1.1 and rise again back to 1.6.

I use adam optimizer with cross entropy loss. What would the issues causing this problem?

CedricLy · January 17, 2021, 12:18pm

I had similara issues:
Loaded model is not the same - PyTorch Forums

So what you can check, if you use a scheduler which also contains the information of last_epoch.

Otherwise you can check if you calculate your loss values in model.eval(). This makes a difference especially if you have dropout layers.

If everything is fine, than there might also be other non-deterministic reason. You can read this up here:
Reproducibility — PyTorch 1.7.0 documentation

Jia_Ming · January 17, 2021, 3:25pm

The valid loss is calculated after calling the model.eval() everytime. After the calculation, the parameters are saved. In theorectical view, by loading these parameters to calculate valid loss in model.eval, there should not be a big gap of losses…