When resuming, model.train accuracy is virtually 0 but model.eval accuracy is consistent (same for losses)

ragnaarok · August 24, 2020, 12:21pm

Hi,

I have a problem where I have 0 accuracy (and big loss) in training after resuming.
I managed to understand where it’s coming from.
When I resume to evaluate, and do “model.eval()” before processing data, I have a nice accuracy.
But when I resume before training, and do “model.train()”, my accuracy drops to 0.
I have a normal batch size, same between train/eval, I’m processing the same data, the same way, the only difference is the “model.train()” => really high loss, but with “model.eval()” => low loss.

Do you know how I could solve it?
Thanks for your answers

RaLo4 · August 24, 2020, 12:30pm

could it be that you save / load only the model or model.state_dict?
Because for resuming training you should save and load a Checkpoint

If so please refer to this and/or this tutorial

ragnaarok · August 24, 2020, 12:40pm

Maybe I wasn’t precise enough. It’s not that my accuracy slowly drops to 0, or that I’m wrongly training.
Because in that case, it could be a problem with the optimizer state dict etc. (I checked that possibility)

The problem is that, even without training, without doing any loss.backward() or optimizer.step(), I already have a loss that indicates that my model is garbage when it’s configured with model.train(). But everything is ok when I use model.eval(). For the exact same code, dataloader etc…
It’s like if the train() method was making my model completely useless.

It’s not the training procedure that I don’t even have to execute to already see the big difference in loss between train() and eval(). I’m doing fp32 training. I know that with misconfigured fp16 training you could have bad eval() and good train(), but here it’s the opposite.
Is there a way to reconfigure the model.train() the same way it’s configured when I do model.eval()? like manually re-modifying the batchnorm layers etc… ?

RaLo4 · August 24, 2020, 1:15pm

I am sorry I misunderstood your question at first.

But with the information given I can’t say anything more.
I also tried to recreate your problem but couldn’t. Everything works fine on my end.
There is no noticeably large difference in loss between using model.eval() or model.train() after loading.

ragnaarok · August 24, 2020, 1:23pm

That’s also my results for many models.
But I don’t know why this one has this problem, if someone has had this problem, maybe it could help.
It would be hard for me to give a minimal reproducible example, I’m using other libs, maybe it’s because I’m using deformable conv but no one reported this issue.

I’ll keep looking into it anyway

ragnaarok · August 24, 2020, 1:59pm

Don’t know if that could help in helping me, but if I do the standard training procedure but I use model.eval() instead of model.train(), everything goes as expected. No accuracy drop whatsoever

ragnaarok · August 25, 2020, 12:12am

One more info, despite the training starting back from 0, it gets close to its saved score really fast. So I guess the convolution layers are fine but It’s just like if I had to retrain the batchnorm from start maybe

yairkit · October 10, 2020, 8:28am

Hi,
I’m having the exactly same issue.
did you find the reason for this?

ragnaarok · October 18, 2020, 10:18am

No. You can restart a training with the old weights, some of them are good so the model will progress quickly and you can see then if it happens again, it probably won’t.
I had this problem but I don’t have it anymore