Batch size reduces testing set accuracy - pre trained CNNs


I’m using some pre-trained CNNs (ResNets and VGGs) - I’m trying to aggregate the classification using averaging of the softmax output vectors - only on the test set of CIFAR10.

Once I iterate over the testing set with batch size = 128/256, the accuracy is around 92%~
Once I iterate over the testing set with batch size = 1, the accuracy is around 12%~!

Again, I’m using pre-trained CNNs, input is the test set of CIFAR10.

What could be the problem?


It’s hard to infer what the issue is with just a description of the batch sizes; could you post a code snippet showing your data loading pipeline for the test set?

Looking at popular reference implementations (e.g., kuangliu/pytorch-cifar: 95.47% on CIFAR10 with PyTorch ( might also help.

import torch

from tqdm import tqdm

from torchvision import datasets, transforms, models

from import DataLoader


model_names = [





        # "cifar10_resnet56",


batch_size = 1

test_transform = transforms.Compose([



def load_models():

    models = []

    for model_name in model_names:

        model = torch.hub.load("chenyaofo/pytorch-cifar-models", model_name, pretrained=True)


    return models

testset = datasets.CIFAR10(root='./data', train=False,

                                       download=True, transform=test_transform)

testloader =, batch_size=batch_size,

                                         shuffle=False, num_workers=1)

models = load_models()

model = torch.hub.load("chenyaofo/pytorch-cifar-models", "cifar10_resnet20", pretrained=True)

import torch.nn as nn
import torch

class MyEnsemble(nn.Module):

  def __init__(self, modelA, modelB, modelC, modelD):
    self.modelA = modelA
    self.modelB = modelB
    self.modelC = modelC
    self.modelD = modelD
    # self.modelE = modelE

  def forward(self, x):
    out1 = self.modelA(x)
    out2 = self.modelB(x)
    out3 = self.modelC(x)
    out4 = self.modelD(x)
    # out5 = self.modelE(x)

    out1 = torch.softmax(out1, dim=1)
    out2 = torch.softmax(out2, dim=1)
    out3 = torch.softmax(out3, dim=1)
    out4 = torch.softmax(out4, dim=1)
    # out5 = torch.softmax(out5, dim=1)

    out = out1 + out2 + out3 + out4

    # out = out / 2

    return out

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = MyEnsemble(models[0], models[1], models[2], models[3])

total = 0
correct = 0
with torch.no_grad():
    for images, labels in tqdm(testloader):
        images, labels =,
        outputs = model(images)
        _, predictions = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predictions == labels).sum().item()

print('Accuracy of the network on the 10000 test images: %d %%' % (
    100 * correct / total))

Edit: I misread testing for training. My bad.

Old answer

Be aware that the bigger the batch size the better (as long as you have enough computing power). This is because with a bigger batch size, the more samples are considered to calculate the gradient.

If you choose a batch size of 1, you would be optimizing your network in each step for the image it sees. Instead, if you show it a bigger amount of images in one step, the gradient will be calculated so it decreases the error for all of the shown images.

Theoretically, you should use the whole dataset as a batch so you would do the gradient update on the whole dataset. But this is not always possible due to constraints in the resources you may have available. This is why we use batch training (sometimes refered as mini-batch).

Finally, have in mind that some types of normalization (e.g. Batch Normalization) need at least two samples to learn.

Well, you’re right - BUT, I don’t train my CNNs…
I use pretrained ones… so basically I only use feedforward, without gradient descent…

So once I use batch of size 1 - I get MUCH longer runtime and MUCH lower accuracy on test set, in contrast to batch of size 32/128/256.

I want to emphasize again - I don’t train, I only use pre-trained CNNs.

This drop in accuracy is because your models use Batch Normalization. Batch Normalization has been shown to work badly when the batch size is small. From this SO answer:

unless you can explicitly justify it, I advise against using BatchNormalization with batch_size=1; there are strong theoretical reasons against it, and multiple publications have shown BN performance degrade for batch_size under 32, and severely for <=8. In a nutshell, batch statistics “averaged” over a single sample vary greatly sample-to-sample (high variance), and BN mechanisms don’t work as intended.