What does model.eval() do for batchnorm layer?

liangstein · September 7, 2017, 3:54pm

Hi Everyone,
When doing predictions using a model trained with batchnorm, we should set the model to evaluation model. I have a question that how does the evaluation model affect barchnorm operation? What does evaluation model really do for batchnorm operations? Does the model ignore batchnorm?

smth · September 9, 2017, 3:46pm

During training, this layer keeps a running estimate of its computed mean and variance. The running sum is kept with a default momentum of 0.1.

During evaluation, this running mean/variance is used for normalization.

Reference: http://pytorch.org/docs/master/nn.html#torch.nn.BatchNorm1d

Joe1 · February 1, 2018, 6:39am

I got the same problem when I trained a model using BN layer. If I just need to test one image,the BN layer will affect the result because of the change of batch size?

SimonW · February 1, 2018, 6:58am

When evaluating you should use eval() mode and then batch size doesnt matter.

Joe1 · February 2, 2018, 9:20am

Thanks~ I have solved the problem~

synchro · February 22, 2018, 3:15am

Hey Soumith,
Maybe a trivial question:

Trained a model with BN on CIFAR10, training accuracy is perfect
Testing with model.train(True) will get 76% accuracy
Tesing with model.eval() will get only 10% with a 0% in pretty much every category.

Why is this? It should be the opposite, right? @smth

SimonW · February 22, 2018, 3:47am

How did you construct the BN layers?

synchro · February 22, 2018, 1:29pm

Standard way:

nn.BatchNorm2d(64)

Where 64 is the num of output filters of the previous layer.

SimonW · February 22, 2018, 4:48pm

That’s weird. Do you mind sharing your script?

synchro · February 23, 2018, 6:59pm

Hey Simon, sorry for being late.

The definition of the model is as follows, ignore the fact that hyper-params shoudn’t be defined in that way (it’s an old code). Any idea?

class Keras_Cifar2(nn.Module):
    def __init__(self, rank1, rank2):
        super(Keras_Cifar2, self).__init__()

        # hyperparams
        self.kern = 3  # for all layers
        self.filt_size1 = 32
        self.filt_size2 = 64
        self.filt_fc1 = 512
        self.num_classes = 10

        self.conv1 = nn.Conv2d(3, 32, 3, padding=1)
        self.conv2 = nn.Conv2d(32, 32, 3)
        self.conv3 = nn.Conv2d(32, 64, 3, padding=1)
        self.conv4 = nn.Conv2d(64, 64, 3)

        self.pool = nn.MaxPool2d(2, 2)

        self.bn_1 = nn.BatchNorm2d(1)
        self.bn_2 = nn.BatchNorm2d(rank1)
        self.bn_3 = nn.BatchNorm2d(rank2)
        self.bn_4 = nn.BatchNorm2d(self.filt_fc1)
        self.bn_5 = nn.BatchNorm2d(self.num_classes)
        self.bn_6 = nn.BatchNorm2d(32)
        self.bn_7 = nn.BatchNorm2d(64)

        # decomposition
        self.cpdfc1 = nn.Conv2d(64, rank1, 1)
        self.cpdfc2 = nn.Conv2d(rank1, rank1, (6, 1))
        self.cpdfc3 = nn.Conv2d(rank1, rank1, (1, 6))
        self.cpdfc4 = nn.Conv2d(rank1, self.filt_fc1, 1)

        # conv2fc
        #self.conv2fc1 = nn.Conv2d(64, self.filt_fc1, 5)
        self.conv2fc2 = nn.Conv2d(self.filt_fc1, self.num_classes, 1)


    def forward(self, x):

        x = F.relu(self.conv1(x))
        x = self.bn_6(x)
        x = self.pool(F.relu(self.conv2(x)))
        x = self.bn_6(x)

        x = F.relu(self.conv3(x))
        x = self.bn_7(x)
        x = self.pool(F.relu(self.conv4(x)))
        x = self.bn_7(x)

        x = self.cpdfc1(x)
        x = self.bn_2(x)
        x = self.cpdfc2(x)
        x = self.bn_2(x)
        x = self.cpdfc3(x)
        x = self.bn_2(x)
        x = F.relu(self.cpdfc4(x))
        x = self.bn_4(x)
        x = self.conv2fc2(x)
        x = self.bn_5(x)

        x = x.view(-1, self.num_classes) 
        return x

SimonW · February 23, 2018, 7:13pm

You shouldn’t re-use BN layers. For example, here

self.bn_6 sees data from two different layers, but accumulating the values to the same running stats buffer. Then the running stats will be inaccurate and the performance will suffer in eval() mode. Make sure that each BN layer is used only at one place in the network.

synchro · February 23, 2018, 7:55pm

Oh, right. Probably when I wrote it I thought to define just a layer type and then use it as many times as it needed. But lol, that’s a single member.

Then a second question is: what’s the best practice in making heavy use of BNs? Write just as many as one needs, or define all of them in a dictctionary of BNs?

SimonW · February 23, 2018, 8:07pm

In your case, the network is pretty sequential. So I’d suggest use construct a list of layers in sequential order (F.relu can also be written with module nn.ReLU) and use the nn.Sequential wrapper.

synchro · February 23, 2018, 10:41pm

Yup. I didn’t use the Sequential container since it was taken like that from the tutorial, but it’ll get definitely cleaner with that.

flyingmoth · March 6, 2018, 8:12am

@smth I want to know the parameter running_var in the batch normalization refers to the variance or the standard deviation?

vfmatzkin · August 4, 2018, 2:52am

Thanks, I had the same issue.

Frida · February 9, 2019, 5:35pm

Hi
Can someone help me to understand why applying model.eval is better in the testing phase?
Thanks in advance

lugiavn · February 9, 2019, 6:46pm

Remove your last or even 2nd last BN. The BN normalizes feature, the last output is class scores and should not be normalized.

train mode BN uses stat from the batch, test phase it is essentially “cheating” because it accesses to other examples in the batch (hence cannot perform if batch size = 1)

Frida · February 9, 2019, 6:48pm

I mean, why is it better to use model.eval and take the running statistics and not rely on the current test image statistics?

lugiavn · February 9, 2019, 6:51pm

because the params are trained on train stats, if test stats are different, then the result might be different
if you compute test stat, then you are basically “train” on test set, because stat in this case is a trained param. You can do it, nobody says you can’t, just that ppl would consider it “cheating”