Different batchsizes give different outputs in model.eval() mode

Hello,
I could not find the solution from anywhere.
Please help me with this problem.

I trained my model with batch size of 32 (with 3 GPUs).
There are Batchnorm1ds in the model. ( + some dropouts)

During testing,
I checked

model.eval()
track_running_stats = False

When I load a sample test data x, and process with the model, model(x), the result is totally different from the outputs during training.
For example, let’s say the model outputs range 0~0.99 during training with batchsize of 32, while 0~0.05 with batchsize of 1 during testing.

In order to examine further, I loaded 2 or more test data from the dataloader and processed through the model with different batch size and saw different outputs even if the data were the same.

For example, I loaded 4 items (x) from the loader,

x1 = x[:2] # batchsize 2
x2 = x[2:] # batchsize 2

x11 = torch.cat([x[0], x[0]], axis=0) # batchsize 2 of the same data

y0 = model(x[0:1]) # batchsize 1
y1 = model(x1) # batchsize 2
y2 = model(x2) # batchszie 2
y11 = model(x11) # batchsize 2 of the same data
y = model(x) # batchsize 4

print(torch.allclose(y1, y[:2])) # False. y[:2] is different from y1
print(torch.allclose(y2, y[2:])) # False. y[2:] is also different from y2

print(torch.allclose(y11[0], y11[1])) # True
print(torch.allclose(y1[0], y11[0])) # False. y11[0] outputs the same as y11[1], but different from y1[0]
print(torch.allclose(y1[0], y0[0]), torch.allclose(y11[0], y0[0])) # False, False. y0 is also different from y11[0] or y1[0]


The problems are

  1. The model results in different values according to the batch size during testing.
  • y[:2] is different from y1, and y[2:] is also different from y2. y0 is also different from y11[0] or y1[0]
  1. Especially, if the batch size is 1 as y0 case, the output histogram ranges 0~0.05
    (which is not intended) while case of batchsize 2 or more with different items results in 0~0.99 (which is as intended during training).

  2. The model results in the same value if the batchsize is increased manually with the same data. y11[0]==y11[1] returns True. This seems correct but, the histogram still ranges 0~0.05.

I think the problems are due to the Batchnorm1ds.
Can someone help me or give a hint to solve them?

I guess somehow your model is doing something wrong with the batch dimension.
you can narrow your problem down by doing simpler test only on suspicious parts of your model.

def test(model, in_shape):
    batch_size = in_shape[0]
    input_all = torch.randn(*in_shape) # input does not have to be test data
    input_1st_half = input_all[:batch_size//2]
    input_2nd_half = input_all[batch_size//2:]

    output_all = model(input_all)

    output_1st_half = model(input_1st_half)
    output_2nd_half = model(input_2nd_half)

    output_concat = torch.cat([output_1st_half, output_2nd_half], dim=0)
    return torch.allclose(output_all, output_concat)

model = nn.BatchNorm1d(10) # model does not need to be trained
model.eval()

test(model, [16, 10, 31])

Why are you setting track_running_stats = False?
Based on the docs, it’ll use the batch statistics instead of the running stats, which explains the difference:

track_running_stats – a boolean value that when set to True , this module tracks the running mean and variance, and when set to False , this module does not track such statistics and always uses batch statistics in both training and eval modes. Default: True

Small example:

bn = nn.BatchNorm2d(3)
bn.eval()

x = torch.randn(2, 3, 24, 24) + 100
out = bn(x)
print((out - x).abs().max())
> tensor(0.0005, grad_fn=<MaxBackward1>)

bn.track_running_stats = False
out = bn(x)
print((out - x).abs().max())
> tensor(100.1011, grad_fn=<MaxBackward1>)
2 Likes

Thank you for advising SunQpark.

I removed all of batchnorm1ds in my model and now training again.
And the problems are gone.
I just wonder what is wrong with using batchnorm1ds…

Thank you for the comments, ptrblck!
I think I found the reason for the problems above.

Originally I was using BatchNorm1ds like in the first code bellow.
With the first code, I had to use track_running_stats = False in order to have the reasonable outputs.
Unless, the outputs were always 0.

The shape of the initial input is (N,400,3): Length 400, channel 3
After nn.Linear layer, it becomes (N,400,32)
Then I wanted to pass it through nn.BatchNorm1d.
I initially thought that the batch norm should be done on 3rd dimension, but as BatchNorm1d takes inputs of shape (N,C,L) I transposed 2nd and 3rd dimension like the code bellow.

# First code. I had to use track_running_stats = False in order to avoid outputs with all-zero value.
class someModel(torch.nn.Module):
    def __init__(self):
        super(someModel, self).__init__()
        self.input_size = 3 
        self.dims = 32
        self.relu = torch.nn.ReLU()
        self.fc1 = torch.nn.Linear(self.input_size, self.dims)
        self.bn1 = torch.nn.BatchNorm1d(self.dims)

    def forward(self, x):
        # x shape: N,400,3 
        x = self.relu(self.bn1(self.fc1(x).transpose(1,2))).transpose(1,2)
        return x # N,400,32

After reading your comment, I removed transpose operation and batch norm on 2nd dimension (400) rather than 3rd dimension(32) like code bellow

# Second code. I do not have to set track_running_stats = False, anymore.
class someModel_2(torch.nn.Module):
    def __init__(self):
        super(someModel_2, self).__init__()
        self.input_size = 3 
        self.dims = 400 # CHANGED HERE 32 -> 400
        self.relu = torch.nn.ReLU()
        self.fc1 = torch.nn.Linear(self.input_size, self.dims)
        self.bn1 = torch.nn.BatchNorm1d(self.dims)

    def forward(self, x):
        # x shape: N,400,3 
        x = self.relu(self.bn1(self.fc1(x))) # removed transpose
        return x # N,400,32

The problems are now gone and looks second way is right.
And also, don’t need to set track_running_stats = False.

I have one last question though.
I am not familiar with these 3D Tensors (N,C,L) or (N,L,C)…

Why Batch norm should be done on the 2nd dimension of size 400 (L) like in the second way?
Should it not be done on 3rd dimension of size 32 (C ) as it become (N,400,32) after the nn.Linear layer?

Can you or someone explain in easy way?

Thank you once again for fixing my problem, ptrblck!