Model.eval() gives incorrect loss for model with batchnorm layers

(sz) #14

@smth Agreed that running_mean may mean different things in Caffe and PyTorch. However in eval mode, I guess only these 5 things - (running_mean, running_var, weight, bias, eps) should affect the final output.

I’ll try to see if I can come up with a minimal reproducible example for this.

(Falmasri) #15

@smth What do you mean by non stationary training ?


@falmasri means the statistics of activations change rapidly during training, such that the running_mean and running_std statistics for BatchNorm at the momentum of 0.1 are not valid anymore.

(Blitzkrieg) #17

Is it theoretically incorrect though?

(Blitzkrieg) #18

Nice, setting the momentum to 0.5 seems to make loss calculated using model.eval() similar to the loss computed using model.train(), but only after few epochs of differing results.

Does the second suggestion work only when we are loading the model for eval? It should not affect the loss calculation, correct or incorrect, if we are training the network and running validation every nth step.

(Xingyi Zhou) #19

I also met this problem in my project (See my answer at and In short, down-grading pytorch version to 0.1.12 will resolve the problem. But I really don’t know what happens to the BN implementation from 0.1.12 to the later versions …

(Simon Wang) #20

I replied on the issue, but running stats is unstable in nature with batch size only being 1.

(Xingyi Zhou) #21

Thanks for the reply! The training batch size is 6 instead of 1. Actually I have also tried later batch size (32) with other architectures (upsampling on ResNet18) but the bug remains. My major question is I don’t understand why pytorch 0.1.12 works while >= 0.2 does not.

(Youkaichao) #22

I think this is not about the momentum. I have the same problem. when I call


every call of model(input) is almost the same if it is after model.train() , but differs with what follows model.eval().

(Félix Lessange) #23

Is there any solution to this ? The solution provided in Performance highly degraded when eval() is activated in the test phase only solves partially the problem for me (gain of accuracy, but not the full deal).

This is a post of a problem related to this one I posted months ago Conflict between model.eval() and .train() with multiprocess training and evaluation. No new solution is outlined in here.

(Siyi Deng) #24

Same problem here, pytorch 0.4.1 gives much worse result even if the evaluation is done with the training data.

(Simon Wang) #25

of course it is different… you are updating the running statistics every time you do a forward in training mode, hence changing the eval behavior.

(Lyne Tchapmi) #26

I believe this is likely a pytorch bug rather than a model instability issue. I encounter this issue randomly when training/testing my models and the solution to the instability issue has been either removing all BatchNorm layers from my model or downgrading to pytorch 0.1.12. A reproducible example can be found in this repository for a recent CVPR18 paper. Discussion surrounding the issue can be found here. This code is consistently stable only with pytorch 0.1.12.

(Victor Tan) #27

Hi, I have the same problem. Have you solved it yet?


If the mean and variance of the training data is as mentioned non-stationary, which may arise from a small batch size, you could try nn.BatchNorm2d(out_channels, track_running_stats=False) this disables the running statistics of the batches and uses the current batch’s mean and variance to do the normalization I believe. It worked for me :slight_smile:

1 Like
(Victor Tan) #29

Thanks a lot! You solution may be worth a try. But I think the BatchNorm may be not the best choice for some applications.

(Kieumy) #30

I like this way, it works for me. Thank you very much.


I also meet the instability issue randomly.
Sometimes, use model.eval() works well, sometimes, model.eval() does not work.
This should be a bug, since it does not occur in tensorflow or Keras (all under same setting: same network, same batch size, etc…).


Could you post a reproducible code snippet so that we could have a look?
Also, have a look at the reproducibility docs in case you haven’t seen them.

(田晋宇) #33

Replying for collecting