Performance highly degraded when eval() is activated in the test phase

I encountered the same problem here, test performance even gets worse and worse as the epoch increases, I’v set model.train() before optimizer.zero_grad(), same results.

Hi, have you sovled the problem? I encounter this problem …

Have encountered the same problem, cannot understand why does this happen. If @Soumith_Chintala or somebody from the community could help, would be great!

I have also tried weight decay & dropout, considering the problem as a case of over-fitting, yet the problem persists.

Hi, have you solved this problem? my model overfits and using model.eval() with dropout is not helping

I’m getting this with lesser minibatches. Especially when training on several GPUs.

same problem here. hope someone could answer

I also ran to the same problem, for my case the model uses batch norm heavily in different layers. I poked around the code, I think the problem might come from track_running_stats ( line 64 in this file: ).

I solved the problem by setting the track_running_stats=False for all batch norm layers in the model. This is due to a bug I think in line 64 of the above-mentioned file.

if you want the model to work in eval mode, simply set the track_running_stats to False for all batch norm layers:

for child in model.children():
    for ii in range(len(child)):
        if type(child[ii])==nn.BatchNorm2d:
            child[ii].track_running_stats = False

Guys, this is not a bug. You are getting unstable estimates with small batch size. This is natural.

1 Like

How do you explain it not working when eval() is set then ?

Exactly what I said above: with a small batch size, running average estimate is unstable. So you get worse results using that.

I have got the same exact problem in the 1d convolutional network in pytorch 0.4.1. Batch size is 128 so it shouldn’t be connected to small batch size. Disabling the eval() fixes the problem.

Removing batchnorm fixes the problem. It must be that batchnorm is the problem. But would this be a problem once you save the model, load it and use it in production where batch size can be 1?


Another common mistake that I’ve seen is to re-use the same BN layer in different places of the network. Is it possible that this is the reason? If not, could you share the model definition with us?


OP replied to this that even with larger batch size and smaller momentum, the problem persists. I have the same problem.

I’m using this code to do gesture recognition


Why don’t you just remove the batch norm layer all together? For smaller datasets, this helped me.

Supposedly other normalization layers, like the group norm, fix this problem.

1 Like

I was training a model containing batch norms, and also saw degraded performance when using model.eval().

I tried:

  • change the momentum term in BatchNorm constructor to higher.
  • before you set model.eval() , run a few inputs through model (just forward pass, you dont need to backward). This will help stabilize the running_mean / running_std values.
  • increase Batchsize

Nothing helped.

Using GroupNorm actually fixed it, but I think BatchNorm is still the superior normalization so I wanted to use that.

In the end I saw I was indeed using the same BatchNorm layers in different parts of the network. Once I changed that it worked again.

Hope this helps for other people!


Had same problem, this fixed it for me, thanks @mohsen

It solves my problem. Is this a bug of Version 0.4.1? Cause it means model.eval() cannot disable the Batchnorm running mean?

Hey Simon,

I’m seeing this exact same issue and I’m using version pytorch 1.0rc so I’m thinking nothing is bugged. My model is the pytorch densenet implementation found at: In the model the BN layers are generated in a loop, but they do have overlapping names (norm1, norm2) line 22, line 26 respectively. I’m not as familiar with generating BN layers like this, could this indicate the “re-use” issue you are talking about?

I am also using small batch sizes (because my input is huge) so running average estimate is another possibility.


1 Like

Are you using the exact same implementation as densenet? If so, “reusing” is not a problem because each norm1 is assigned to a different module object.

1 Like

Thanks @mohsen , setting the track_running_stats equal to False solved my problem