Different batch sizes give different test accuracies

suzoosugar · February 26, 2019, 4:40pm

I think it’s because the behavior of BN layer.

Also by default, during training this layer keeps running estimates of its computed mean and variance, which are then used for normalization during evaluation. The running estimates are kept with a default momentum of 0.1.
==batch norm==

When change the batch size of evaluation, mean and variance also changed. So, get different results.

You can fix the evaluation batch size same as training, and split or concat the results to the shape you want.

fanlu · April 23, 2019, 12:16pm

Hi, @ptrblck

In [43]: model.eval()
Out[43]:
localatt(
  (fc1): Linear(in_features=13, out_features=512, bias=True)
  (fc2): Linear(in_features=512, out_features=512, bias=True)
  (do2): Dropout(p=0.5)
  (blstm): LSTM(512, 128, batch_first=True, dropout=0.5, bidirectional=True)
  (fc3): Linear(in_features=256, out_features=2, bias=True)
)
x = torch.randn(10, 120, 13)
output_all = model(x, torch.tensor([120]*10))
output_1 = model(x[:5], torch.tensor([120]*5))
output_2 = model(x[5:], torch.tensor([120]*5))
output_stacked = torch.cat((output_1, output_2), dim=0)
print(torch.allclose(output_all, output_stacked))

I got the result False. the results of each output is below

In [51]: output_all
Out[51]:
tensor([[0.0013, 0.9987],
        [0.0016, 0.9984],
        [0.0019, 0.9981],
        [0.0014, 0.9986],
        [0.0013, 0.9987],
        [0.0016, 0.9984],
        [0.0012, 0.9988],
        [0.0014, 0.9986],
        [0.0015, 0.9985],
        [0.0017, 0.9983]], grad_fn=<SoftmaxBackward>)

In [52]: output_1
Out[52]:
tensor([[0.0286, 0.9714],
        [0.0316, 0.9684],
        [0.0338, 0.9662],
        [0.0296, 0.9704],
        [0.0281, 0.9719]], grad_fn=<SoftmaxBackward>)

In [53]: output_2
Out[53]:
tensor([[0.0316, 0.9684],
        [0.0278, 0.9722],
        [0.0297, 0.9703],
        [0.0304, 0.9696],
        [0.0323, 0.9677]], grad_fn=<SoftmaxBackward>)

In [54]: output_stacked
Out[54]:
tensor([[0.0286, 0.9714],
        [0.0316, 0.9684],
        [0.0338, 0.9662],
        [0.0296, 0.9704],
        [0.0281, 0.9719],
        [0.0316, 0.9684],
        [0.0278, 0.9722],
        [0.0297, 0.9703],
        [0.0304, 0.9696],
        [0.0323, 0.9677]], grad_fn=<CatBackward>)

My model definition is below.

class localatt(nn.Module):
    def __init__(self, featdim, nhid, ncell, nout):
        super(localatt, self).__init__()

        self.featdim = featdim
        self.nhid = nhid
        self.fc1 = nn.Linear(featdim, nhid)
        self.fc2 = nn.Linear(nhid, nhid)
        self.do2 = nn.Dropout()


        self.blstm = tc.nn.LSTM(nhid, ncell, 1, 
                batch_first=True,
                dropout=0.5,
                bias=True,
                bidirectional=True)

        self.u = nn.Parameter(tc.zeros((ncell*2,)))
        # self.u = Variable(tc.zeros((ncell*2,)))

        self.fc3 = nn.Linear(ncell*2, nout)

        self.apply(init_linear)

    def forward(self, inputs, lens):

        batch_size = inputs.size()[0]

        indep_feats = inputs.view(-1, self.featdim) # reshape(batch) 

        indep_feats = F.relu(self.fc1(indep_feats))

        indep_feats = F.relu(self.do2(self.fc2(indep_feats)))

        batched_feats = indep_feats.view(batch_size, -1, self.nhid)

        packed = pack_padded_sequence(batched_feats, lens, batch_first=True) 

        output, hn = self.blstm(packed)

        padded, lens = pad_packed_sequence(output, batch_first=True, padding_value=0.0)

        alpha = F.softmax(tc.matmul(padded, self.u))

        return F.softmax((self.fc3(tc.sum(tc.matmul(alpha, padded), dim=1))))

Are there some problems in my model? thanks very much

ptrblck · April 24, 2019, 10:36pm

Based on your code it looks like tmp = torch.matmul(alpha, padded) will create differently shaped outputs for the whole and sliced inputs.
I’ve reduced the seq_len of your input for easy debugging:

x = torch.randn(10, 12, 13)

Passing x completely will create an intermediate activation tmp of torch.Size([10, 10, 256]), while the sliced inputs will create tensors of torch.Size([5, 5, 256]).
The following torch.sum(tmp, dim=1) operator will thus yield different results.

fanlu · April 25, 2019, 9:55pm

Thank you so much. My problem is gone.

Alberto_Gomez · August 29, 2019, 1:07pm

Hi @ptrblck, @Annus_Zulfiqar and all,

i am having a similar problem: I train an image classifier (7 classes) which basically has a few layers of conv2d->Relu->BatchNorm, using strided convolutions, then a couple of linear layers and using CrossEntropyLoss as criteria.

I train with a batch size of 50. I want to use this for real time classification from a stream of images (video) so in inference time I always classify one image at a time.

As a result I want to compute accuracy at test time with a batch size of 1, but when I do so accuracy drops massively. When I use a batch size of 30 or more, accuracy seems consistently good.

I am doing test within a with torch.no_grad(): environment, and calling net.eval() too.

How is it possible that the batch size influences accuracy in test time? Should the average and variance for the BN not be taken from what was learned? Any advice?

Thanks!

Alberto

ptrblck · August 29, 2019, 1:30pm

The batch size should not change the accuracy during evaluation/testing, if model.eval() was called.
Do you have a reproducible code snippet so that we can have a look?

Alberto_Gomez · August 30, 2019, 12:43pm

Hi,
thanks for your quick answer. I have created a minimal example in a notebook here:
https://colab.research.google.com/drive/1ZO6NawopCnndE8Q3yQhT7i_fAzH4qNqe

In summary, I run the model like this:

import numpy as np
import torch

image_size = np.array((256, 256))
N_CLASSES = 7
data100 = torch.rand(100, 1, image_size[0], image_size[1])
net = NetAG(image_size, N_CLASSES)
net.eval()

BATCH_SIZE = 100
with torch.no_grad():
  x_in = data100[:BATCH_SIZE,:,:,:].type(torch.FloatTensor)
  out100, _ = net(x_in)

BATCH_SIZE = 1
with torch.no_grad():
  x_in = data100[:BATCH_SIZE,:,:,:].type(torch.FloatTensor)
  out10, _ = net(x_in)

print(out100[0])
print(out10[0])

The result is:

tensor([-0.0814,  0.1082, -0.0612, -0.0544, -0.1154,  0.0559, -0.0581])
tensor([-0.0812,  0.1085, -0.0614, -0.0544, -0.1154,  0.0558, -0.0578])

They are “similar” but not the same. However if the batch size is the same in both cases, the result is identical.

This behaviour further worsens when using the trained model ion real data

Any advice?
Thanks!

Alberto_Gomez · August 30, 2019, 4:44pm

Dear all,
I actually found the bug in my model: i was normalising data over batches somewhere else. As @ptrblck said, if you use model.eval() then the batch size has no influence in test time.
The link above has the solution and minimal example for reference of others.
Many thanks!
Alberto

kefsun1a · January 17, 2023, 3:37pm

For a regression problem, if different errors (MAE or relative error) are obtained for different batch sizes using the same model and the same testing dataset, please check if torch.squeeze() is applied to the output of the model (i.e., output of the forward function).

def forward(self, h_nodes, data):
    output = self.processing(h_nodes, data)
    # return torch.squeeze(output)
    return output

Tonis · January 7, 2024, 11:02am

I think I encountered the same issue as you, and I perform a simple experiment to show that it may be the problem of the GPU card.