Model fails after saving and loading state_dict from earlier epochs

I was using this code to ensemble the model from several epochs, but as you can see from the kappa score printed out below, except for the last epoch, all other epochs have very strange scores:

global labels_v 
val_preds = np.zeros(len(val19_df))
for i in range(len(Ws)):
    if Ws[i] > 0:
        model.load_state_dict(torch.load(f'weight_best{i * 4 + 5}.pt'))
        preds_i, labels_v = test_model(return_p=True)  # test_model does inference for validation set in model.eval() mode, and returns the kappa score
# def test_model(val_loader=val_loader, val_df=val19_df, return_p=False)
        val_preds += Ws[i] * preds_i
val_preds /= w_total

kappa score: -0.007528865258592532
kappa score: 0.0
kappa score: 0.0
kappa score: 0.49560576349219887
kappa score: 0.9044690071390352

The exact same test_model function was used at training time and they behaved normally before saving. Model in these epoches were saved this way:

if epoch % 4 == 1:
    print(f'model{epoch} saved'), f'weight_best{epoch}.pt')

And also seems that if I use model.train() mode for inference it behaves more normally. So what’s wrong?

I’m not sure, what Ws contains, did you make sure to call model.eval() inside test_model?