Performance highly degraded when eval() is activated in the test phase

Hey Simon,

I’m seeing this exact same issue and I’m using version pytorch 1.0rc so I’m thinking nothing is bugged. My model is the pytorch densenet implementation found at: In the model the BN layers are generated in a loop, but they do have overlapping names (norm1, norm2) line 22, line 26 respectively. I’m not as familiar with generating BN layers like this, could this indicate the “re-use” issue you are talking about?

I am also using small batch sizes (because my input is huge) so running average estimate is another possibility.


1 Like

Are you using the exact same implementation as densenet? If so, “reusing” is not a problem because each norm1 is assigned to a different module object.

1 Like

Thanks @mohsen , setting the track_running_stats equal to False solved my problem

I emerged same problem when I use PSP-Net which includes Batch norm and dropout.
Should I use it without eval() for prediction?

I have also meet this issue. When I disable cudnn of BN layer, the problem is solved.
You can try it. One example is at


I want to share my experience with this problem. I’m doing video deep learning (gesture detection), which is particularly demanding in terms of memory, at least on my private dataset.

I first encountered this problem in July 2018 Conflict between model.eval() and .train() with multiprocess training and evaluation
I usually have a batch size below 10, and that’s using multi gpu, so rather 3-5 examples per gpu.

I enabled BN in validation to have a real view on my validation metrics with model.train(). But of course this leads to very poor results when I actually want to use my model on real world data.

So what exactly is happening ? Your model has been fitted to understand small batches : it has “overfitted the batch”. This is very counter-intuitive because we’re accustomed to think that samples seen in a batch are purely independently seen, or rather that batch normalization is just some kind of help and an augmentation. But this view is erroneous, especially on small batch size. If you look at the values that a normalized sample will take inside of a small batch, you can see that they will vary highly. In other words, the standard deviation of the values that your sample takes in random batches of your dataset is high. If you had a greater batch size, say 32, the distribution of values that your normalized sample would take after a BN layer, among random batches, would be much more narrow. I think this is a good view to what exactly is happening when I say that the model overfits the batch. Your model is trained to look at a much wider distribution of values, and is not especially smart of the subset of values that it will see when model.eval() is set.

I’ve tested a few solutions, some of them outlined in previous answers.

  • Perform a few forward passes after the training with big batch sizes without gradient descent, and with model.train() set. This WON’T work. Naturally, if you try to do that, your BN layers will change, while your other layers are frozen. But the problem isn’t that the batch norms metrics (std and mean) are wrong. It is that the actual mean and std of the dataset are bad. Your convolutions are actually set in a way that they need to see extreme values. Tuning your BNs this way will produce effects akin to making your network see purely grey images. It will be hard for it to decipher anything salient at every step of the forward pass.

  • Increase momentum of the BN. THis means that the means and stds “learned” will be much more stable during the process of training. With the same reasonning, you can uderstand why this won’t work. The training still sees the same widened distribution. But the if you set your BN so that during the training, they have more chance to capture the real means and std of the dataset, they will capture values that are not suited for your convolutions !

  • Skipping batches so that you artificially have a higher batch size. That is especially wrong. Because in this manner, exactly as before, you will have a higher chance of capturing the real means and stds of the dataset. But what is especially false, is the assumption that the forward and backward passes are happening differently then. If you have a higher BS, you can hope to reproduce this article results But it’s a matter of convergence, not of BN. With this tactic, your model will have a different approach to overfitting the batch (with less stochastic variability), but this will still be its goal.

  • Increasing your memory by adding gpus to your training setup. This is wrong and wasteful. The standard way of doing BN on most frameworks is “gpu specific”. The BN batch mean and std are computed with examples sitting in an individual gpu. That means that with two gpus, at forward time, you actually have two batch means, and stds, at each BN layer. Consequently, thinking that with 4 V100 you’ll solve the problem you had with one, is a really bad strategy for you and your wallet. What would work is having a gpu that is so large that it can fit 32 samples inside its individual memory. For my problem, even 32GB Tesla V100 didn’t cut it (it brought me to 20 samples per gpu, which is not bad, but I could still observe the bad effects of BNs). Nevertheless, depending on your situation, your should try this. Unfortunately, the only 32GB V100 I could use are the ones on the p3dn.16xlarge of aws, which has 8 of them, and is especially costly (and you won’t be able to keep it more than 1H in spot mode).

  • Using group normalization instead of batchnorms. So as advertised in this article this WORKS ! BUT it has some disadvantages, not the least of which is NO ONE THAT RELEASES PRETRAINED MODELS USES IT ! It sound exaggerated but this is really a hurtle for me, because as you might know, doing video deep learning without transferring knowledge is like trying to win the 100m olympic when you’re obese. I’m pretty much forced to perform all pretrainings myself, which is really tedious when you’re benchmarking many architectures of video. So, if you don’t use pretrained models, you might not care and so you should definitely do it. Yet in my experience, groupnorm also slows down the training, and demands more memory, and thus is a bit irritating. And don’t even think you can take a pretrained model using BNs and replace them with GN. It doesn’t work, and you might just as well throw away everything learned after the first BN.

  • THE SOLUTION I FOUND : BN synchronization works ! It means sharing stats between gpus at forward time so that there is only one mean and std computer per BN layer for the whole multi gpu setup. With this layer, your gpus will act as one. This will still require that you have a cumulative gpu memory high enough to hold 32 samples per batch, so it’s still a bit tedious. Still it’s cool to finally have a solution. Plus you can easily transfer from a BN model with this, it won’t bug, and it’s a proper transfer if you think about it !

By the way, if you use a pretrained model and encounter this BN problem, notice that you’re also using your pretrained model very poorly, because it (probably) has been trained to look at very different distributions of means and stds itself. In my experience it’s still better than nothing, but it’s pretty under-optimal.

Another word on why this might not be too problematic for you. If you can just use the exact same batch size as your training in validation, and make sure that your batches are exactly as random as in your training, and you’re setting model.train(), you will pretty much have the best validation metrics you can. This is especially true if your test set is not “reality” but another dataset sitting on your hard drive. It’s not a satisfying solution for me because I want to use my model efficiently, by setting the batch size at the highest value I can, optimizing my model with fusion of layers, and (without going to much into details) feeding very similar data in each batch in inference. But if you don’t have these constraints, you’ll be alright.


In practical scenarios, a trained model usually accept one single input in one time. Therefore the batch_size=1 (small). So, in this case, will the BN layer result in unexpected results?
When you mentioned “Maybe the batch size was too small”, do you mean the batch_size used for training or testing?

This seems equal to execute model.train(). Its’n it?


I finally resolve the problem. I get the expected results when executing model.eval().
I find I wrongly misspelled the code and reused the batch_norm layer. That was why I got the problem.
This resulted in the same situation most people mention here.

After I correcting my code, I get the expected results.
This is the snippet of the code before correction:

def forward(x1,x2):
    #The blocks below all contain BN layer
    out1 = self.block1_0(x1)
    out2 = self.block2_0(x2)
    out1 = self.block1_1(out1)
    out2 = self.block1_1(out2)  # Misspelling.It should be block2_1 called. 
    #Therefore the BN layer is reused.\

I correct the code and FINALLY, IT WORKS.


This worked for me too even with BatchNorm1d layers. One can do type(child[ii]).__name__.startswith('BatchNorm') to cover all cases.

I apot your code, it says that object of type “**” has no len()

for m in model.modules():
    if isinstance(m, nn.BatchNorm2d):

But If the batchisze=1, interpolate got error.

1 Like

Great finding, setting track_running_stats = False fixed my code, thanks!

1 Like

Note: there is a typo in @Wendell_Philips code sample above (runing instead of running), which will cause it to silently do nothing.

It should be:

for m in model.modules():
    if isinstance(m, nn.BatchNorm2d):

I met the same problem but the reason is different from all above.

I train the model with half precision for saving memory, and an ounce of values of running_var (about 9w) is out of the bound of float16(6w+). As a result, some running_var becomes ‘inf’ and yields wrong results when running in eval mode. On the other hand, when running in training mode, the precision seems normal because only few batches yield inf in running_var.

Solution: Use float32 for BN layers even when training in half precision (float16), you may wish to see here for code.

If you think you haven’t made any of the above mistakes, you’d better check your data distribution (means, variance, etc…) of the data batch from test dataloader and train dataloader. BatchNorm will perform bad under .eval() mode if the data distribution of the training set and the test set is very different.

it solves my problem thx a lot!

But this operation is equivalent to removing model.eval(), isn’t it?

1 Like

Have you save the problem?
I got the same problem in my experiment.

Thanks for your detailed reply, according to my understand, what really makes the problem is the difference in batch size between training and testing when using small batch size. I want to know if using model.eval (), and use the same batch size as training in the test, will the problem be solved (don’t consider the batch size should be one in test).