Inconsistent evaluation performance when loading state dict

I am currently experimenting with a network to perform multitask classification on faces.
I evaluate the model every 50 steps (switching between model.eval() and model.train() ) and obtain values like:

  "Task 1":   {"MAE": 4.9965996742248535, "RMSE": 7.064884662628174},
  "Task 2":   {"ACC": 0.9571999907493591}

After training is done, I save the model’s state dict by calling:, os.path.join(PATH, 'model-{}.pt'.format(epoch))) 

The problem is, when loading the model in a jupyter notebook for further testing and calling the evaluation function in the exact same way as called in the training script I get the results:

"Task 1":   {'MAE': 10.82919979095459, 'RMSE': 13.656017303466797},
"Task 2":   {'ACC': 0.5248000025749207}

Which looks to me as if the loaded weights were never trained at all.
I tried training again multiple times and the results are the same. As a hail mary, I tried saving the entire model (and not just the state dict) and to my surprise it worked! The evaluation results were consistent.
Does anyone have any idea of what I’m doing wrong, and how I can get good results from the state dict so I don’t rely on the entire saved model? Thanks a lot!

Are you setting the model to .eval() after loading the state_dict?
Could you post a code snippet to reproduce this issue?

Yes, I’ve been calling model.eval() both during training and and in the subsequent testing. I’m afraid that code snippets to reproduce the issue itself are somewhat problematic, given that I’m not at liberty to disclose my model at this moment.

I understand that this makes the problem somewhat abstract, but any insight into what might be happening would be much appreciated. I suspect it might have something to do that I’m training with transfer learning, loading pretrained weights for my backbone and that is somehow not being translated to my saved state_dict

I get that you are not “at liberty to disclose the model” but a minimal code snippet does not disclose anything as it is usually quite standard. It might even help you pinpoint the source of the issue by trying to remove anything unnecessary or application-specific. So first step: try to reproduce the error with a minimal script (outside of a jupyter notebook as you might miss a hidden-state).

With that being said, there are a few things you wanna check:

  • calling eval() after the load_state_dict: you are saying you call eval() during the training ? You should be calling train() in training and eval() when evaluating ?
  • Are you doing the final test on a different set?
  • You talk about backbone: when saving the state_dict, are you saving only the backbone’s state_dict or the backbone and the last layer/module (which I’m guessing there is as you are mentioning a backbone). This would explain that it works when saving the whole model.
  • Try using the cudnn.deterministic flag for evaluation at least

Hope this helps!

I understand your point about the code snippet. I’ll try to come up with it, if nothing else to gain some more insight about what might be causing my problem exactly.

  • I meant to say that I call .eval() in the training script, when evaluating the model against the validation set. During training itself the model is in training mode.
  • Yes. The data is properly split.
  • Weights for both the backbone and the head module are being saved/loaded in the state dictionary. Just confirmed it.
  • Thanks for the suggestion, I will try that in future experiments on the issue

Additionally to @Latope2-150’s suggestions, also check for functional calls to e.g. dropout (F.dropout), which are missing the training flag. We’ve narrowed down the non-deterministic behavior of similar issues in the past to not disabling dropout properly in eval mode.

1 Like