Model.eval() gives incorrect loss for model with batchnorm layers

I tried to train a model with batchnorm layers. During the training, I set model.train(). Every 100 iteration, I validate the accuracy and set model.eval(). However, the validation is not correct. I don’t think this is due to overfitting because even if I use the same image as training, the testing loss is also quite different from the training loss. Also, if I still set model.train() during testing, the testing loss is correct. But such usage does not make sense because my model contains batchnorm layers.

Below is my training code:

for epoch in range(num_epochs):
    epoch_loss = 0.0
    optimizer = lr_scheduler(optimizer, epoch)

    for iteration, data in enumerate(dataloader, 0):
        iter_index += 1
        label_patch = data['label_patch']
        residue_patch = data['residue_patch']
        stacked_patch = data['stacked_patch']
        microshift_patch = data['train_patch']
        inputs, residues, labels, microshifts = Variable(stacked_patch.type(dtype)), Variable(residue_patch.type(dtype), requires_grad=False), Variable(label_patch.type(dtype), requires_grad=False), Variable(microshift_patch.type(dtype), requires_grad=False)

        # set model to train mode (before zero grad)

        # zero the parameter gradients

        # forward
        outputs = model(inputs)
        loss = criterion(outputs, residues)

        # backward + optimize only if in training phase

        # statistics
        epoch_loss +=[0]

        # test the model every 100 iterations
        if iter_index % logging_iter == 0:
            loss_test, psnr_test = test_model(model)  

    # checkpoint for each epoch
    model_out_path = "checkpoints/model_epoch_{}_residue.pth".format(epoch), model_out_path)

The testing code which is called every 100 training iters is as following:

def test_model(model):
    psnr_test_avg = 0
    loss_test_avg = 0
    for iteration, test_data in enumerate(dataloader_test, 0):
        label_test = test_data['label_patch']
        residue_test = test_data['residue_patch']
        stacked_test = test_data['stacked_patch']
        microshift_test = test_data['train_patch']
        inputs_test, residues_test, labels_test, microshifts_test = Variable(stacked_test.type(dtype), requires_grad=False), Variable(residue_test.type(dtype), requires_grad=False), Variable(label_test.type(dtype), requires_grad=False), Variable(microshift_test.type(dtype), requires_grad=False)
        outputs_test = model(inputs_test)
        loss_mse_test = criterion_mse(outputs_test + microshifts_test, labels_test).data.cpu().numpy()
        loss_l1_test = criterion(outputs_test, residues_test).data.cpu().numpy()
        psnr_test = 10 * np.log10(255 * 255 / loss_mse_test)
        loss_test_avg += loss_l1_test
        psnr_test_avg += psnr_test

    loss_test_avg /= (iteration + 1)
    psnr_test_avg /= (iteration + 1)

    return loss_test_avg, psnr_test_avg

it is possible that your training in general is unstable, so BatchNorm’s running_mean and running_var dont represent true batch statistics.

Try the following:

  • change the momentum term in BatchNorm constructor to higher.
  • before you set model.eval(), run a few inputs through model (just forward pass, you dont need to backward). This will help stabilize the running_mean / running_std values.

Hope this helps.


Thanks for your reply. I tried them but still get the error. I found that using the same code, sometimes the model.eval() can be correct but sometimes incorrect. I will try further and update if I found a solution.

1 Like

Same problem with latest 0.3.0 release here. Have you find any solution? @zhangboknight
Change momentum won’t solve the problem @smth

1 Like

I’m having the same issue. Really a bummer to have to use train mode for validation/testing.

1 Like

I also have the same problem and haven’t figured out the reason.

1 Like

Any update regarding this problem. I already posted the same question. it seems to me that many are facing the same problem. Could pytorch community react to this problem ?

1 Like

I have the same problem.

I’m trying to load caffe weights in a pytorch model with batchnorm layers, each time I load the weights from the caffemodel file, the result for the same input is different even in eval mode.

I’m actually updating the running_mean and running_var from the caffemodel weights, so there shouldn’t be any issue with bad running_means during inference.

@meetshah1995 the meaning of Caffe’s running_mean might be different from pytorch’s running_mean.

1 Like

@falmasri I wrote above in the comment here: Model.eval() gives incorrect loss for model with batchnorm layers with a working answer.

It’s not a problem in the sense that it’s not a software bug.

It’s a problem in the sense that if you have a non-stationary training, you will see this behavior unless you adjust your momentum term of the BatchNorm. We set the momentum to 0.1 because for most workloads that we use it was sufficient. Play around with it.

@smth Agreed that running_mean may mean different things in Caffe and PyTorch. However in eval mode, I guess only these 5 things - (running_mean, running_var, weight, bias, eps) should affect the final output.

I’ll try to see if I can come up with a minimal reproducible example for this.

@smth What do you mean by non stationary training ?

@falmasri means the statistics of activations change rapidly during training, such that the running_mean and running_std statistics for BatchNorm at the momentum of 0.1 are not valid anymore.

Is it theoretically incorrect though?

Nice, setting the momentum to 0.5 seems to make loss calculated using model.eval() similar to the loss computed using model.train(), but only after few epochs of differing results.

Does the second suggestion work only when we are loading the model for eval? It should not affect the loss calculation, correct or incorrect, if we are training the network and running validation every nth step.

I also met this problem in my project (See my answer at and In short, down-grading pytorch version to 0.1.12 will resolve the problem. But I really don’t know what happens to the BN implementation from 0.1.12 to the later versions …

1 Like

I replied on the issue, but running stats is unstable in nature with batch size only being 1.

Thanks for the reply! The training batch size is 6 instead of 1. Actually I have also tried later batch size (32) with other architectures (upsampling on ResNet18) but the bug remains. My major question is I don’t understand why pytorch 0.1.12 works while >= 0.2 does not.


I think this is not about the momentum. I have the same problem. when I call


every call of model(input) is almost the same if it is after model.train() , but differs with what follows model.eval().