Resnet50 gives different outputs with respect to batch size

Hi, during some sanity checking I discovered that torchvision.models.resnet50 (probably other models as well) gives different results when passing in a batch of data versus passing one input at the time. I have ensured that I have set the model to evaluation mode by model.eval().

My question is
Why batch feed forward vs “one at the time” yields different results.

Obervations and thoughts
I initially suspected the BatchNorm2d layers to be the culprit (I had experienced some issues with batch norm in TensorFlow), but after testing on a small model I got consistent results even with BatchNorm2d layers. The code below shows that the absolute difference between batch and “one by one” are very small, with the 99th percentile at 3.81e-06, meaning that 99 percent of the absolute differences are less than 3.81e-06, keep in mind the errors scale with the magnitude of the input.

I now suspect the differences stems from out of order execution, resulting with different accumulations of errors throughout the forward passes. I have sanity checked that there is no randomness happening by forward passing batched and single (one by one) inputs twice to check that they are equal.

Code

import torch
import random
import os
import numpy as np
from torchvision.models import resnet50

# Seed a lot of things
seed = 42069
random.seed(seed)
os.environ["PYTHONHASHSEED"] = str(seed)
np.random.seed(seed)
torch.manual_seed(seed)

model = resnet50(pretrained=True)

# Batch of 4 RGB (3 channel) images of size 400 x 400, (N, C, H, W)
X = torch.randn((4, 3, 400, 400), dtype=torch.float32) * 10 # channels first

with torch.no_grad():
    model.eval()
    # Feed forward batch of 4
    output_batch1 = model(X)  # Run twice to sanity check consistency
    output_batch2 = model(X) 
    # Feed forward one at a time then concatenate
    output_singles1 = torch.cat([model(x.unsqueeze(0)) for x in X]) 
    output_singles2 = torch.cat([model(x.unsqueeze(0)) for x in X]) 
    # Outputs are of shape (N, 1000)

print(f"{' output_batch1 vs output_batch2 ':=^50}")
print("allclose:", torch.allclose(output_batch1, output_batch2))
print()

print(f"{' output_singles1 vs output_singles2 ':=^50}")
print("allclose:", torch.allclose(output_singles1, output_singles2))
print()

print(f"{' batch vs singles ':=^50}")
print("allclose:", torch.allclose(output_batch1, output_singles1))
# Calculate percentiles
absdiff = np.sort(abs(output_batch1 - output_singles1).flatten())
print("Percentiles of abs(output_batch1 - output_singles1):")
for p in (85, 90, 95, 99):
    print(f"\t{p}th percentile:", absdiff[int(len(absdiff) * p/100)])

stdout

========= output_batch1 vs output_batch2 =========
allclose: True

======= output_singles1 vs output_singles2 =======
allclose: True

================ batch vs singles ================
allclose: False
Percentiles of abs(output_batch1 - output_singles1):
        85th percentile: 1.5497208e-06
        90th percentile: 1.9073486e-06
        95th percentile: 1.9073486e-06
        99th percentile: 3.8146973e-06

Version information:
Python 3.8.3:

  • torch==1.7.0
  • torchvision==0.8.0
  • numpy==1.19.2

Ubuntu 18.04.5 LTS (Bionic Beaver)

  • Running program inside Docker container using FROM pytorch/pytorch:1.7.0-cuda11.0-cudnn8-runtime as parent image.

Thanks for the help!

Related topic:
Encountered similar phenomena (varying accuracies respective to batch size), but no explanation

Changing the batch size may change how computation is organized and deviations within numerical accuracy are expected.
For things like convolutions you might even have specialized “inference” kernels for batch size one, but you can even see this with much simpler operations:

If you want CPU computation you could quantize your model and see if this eliminates the effect. I would expect quantized computation to be more stable as it is essentially fixed point instead of floating point.

Best regards

Thomas

1 Like

Thank you, I just wanted to sanity check things. As I said, I suspected out of order execution (which I have understood is the “Changing the batch size may change change how computation is organized” thing that you are mentioning). I just wanted confirmation.

Thanks for the help!