When resuming, model.train accuracy is virtually 0 but model.eval accuracy is consistent (same for losses)


I have a problem where I have 0 accuracy (and big loss) in training after resuming.
I managed to understand where it’s coming from.
When I resume to evaluate, and do “model.eval()” before processing data, I have a nice accuracy.
But when I resume before training, and do “model.train()”, my accuracy drops to 0.
I have a normal batch size, same between train/eval, I’m processing the same data, the same way, the only difference is the “model.train()” => really high loss, but with “model.eval()” => low loss.

Do you know how I could solve it?
Thanks for your answers

could it be that you save / load only the model or model.state_dict?
Because for resuming training you should save and load a Checkpoint

If so please refer to this and/or this tutorial

Maybe I wasn’t precise enough. It’s not that my accuracy slowly drops to 0, or that I’m wrongly training.
Because in that case, it could be a problem with the optimizer state dict etc. (I checked that possibility)

The problem is that, even without training, without doing any loss.backward() or optimizer.step(), I already have a loss that indicates that my model is garbage when it’s configured with model.train(). But everything is ok when I use model.eval(). For the exact same code, dataloader etc…
It’s like if the train() method was making my model completely useless.

It’s not the training procedure that I don’t even have to execute to already see the big difference in loss between train() and eval(). I’m doing fp32 training. I know that with misconfigured fp16 training you could have bad eval() and good train(), but here it’s the opposite.
Is there a way to reconfigure the model.train() the same way it’s configured when I do model.eval()? like manually re-modifying the batchnorm layers etc… ?

I am sorry I misunderstood your question at first.

But with the information given I can’t say anything more.
I also tried to recreate your problem but couldn’t. Everything works fine on my end.
There is no noticeably large difference in loss between using model.eval() or model.train() after loading.

That’s also my results for many models.
But I don’t know why this one has this problem, if someone has had this problem, maybe it could help.
It would be hard for me to give a minimal reproducible example, I’m using other libs, maybe it’s because I’m using deformable conv but no one reported this issue.

I’ll keep looking into it anyway

Don’t know if that could help in helping me, but if I do the standard training procedure but I use model.eval() instead of model.train(), everything goes as expected. No accuracy drop whatsoever

One more info, despite the training starting back from 0, it gets close to its saved score really fast. So I guess the convolution layers are fine but It’s just like if I had to retrain the batchnorm from start maybe

I’m having the exactly same issue.
did you find the reason for this?

No. You can restart a training with the old weights, some of them are good so the model will progress quickly and you can see then if it happens again, it probably won’t.
I had this problem but I don’t have it anymore