Hi, during some sanity checking I discovered that torchvision.models.resnet50 (probably other models as well) gives different results when passing in a batch of data versus passing one input at the time. I have ensured that I have set the model to evaluation mode by model.eval()
.
My question is
Why batch feed forward vs “one at the time” yields different results.
Obervations and thoughts
I initially suspected the BatchNorm2d layers to be the culprit (I had experienced some issues with batch norm in TensorFlow), but after testing on a small model I got consistent results even with BatchNorm2d layers. The code below shows that the absolute difference between batch and “one by one” are very small, with the 99th percentile at 3.81e-06
, meaning that 99 percent of the absolute differences are less than 3.81e-06
, keep in mind the errors scale with the magnitude of the input.
I now suspect the differences stems from out of order execution, resulting with different accumulations of errors throughout the forward passes. I have sanity checked that there is no randomness happening by forward passing batched and single (one by one) inputs twice to check that they are equal.
Code
import torch
import random
import os
import numpy as np
from torchvision.models import resnet50
# Seed a lot of things
seed = 42069
random.seed(seed)
os.environ["PYTHONHASHSEED"] = str(seed)
np.random.seed(seed)
torch.manual_seed(seed)
model = resnet50(pretrained=True)
# Batch of 4 RGB (3 channel) images of size 400 x 400, (N, C, H, W)
X = torch.randn((4, 3, 400, 400), dtype=torch.float32) * 10 # channels first
with torch.no_grad():
model.eval()
# Feed forward batch of 4
output_batch1 = model(X) # Run twice to sanity check consistency
output_batch2 = model(X)
# Feed forward one at a time then concatenate
output_singles1 = torch.cat([model(x.unsqueeze(0)) for x in X])
output_singles2 = torch.cat([model(x.unsqueeze(0)) for x in X])
# Outputs are of shape (N, 1000)
print(f"{' output_batch1 vs output_batch2 ':=^50}")
print("allclose:", torch.allclose(output_batch1, output_batch2))
print()
print(f"{' output_singles1 vs output_singles2 ':=^50}")
print("allclose:", torch.allclose(output_singles1, output_singles2))
print()
print(f"{' batch vs singles ':=^50}")
print("allclose:", torch.allclose(output_batch1, output_singles1))
# Calculate percentiles
absdiff = np.sort(abs(output_batch1 - output_singles1).flatten())
print("Percentiles of abs(output_batch1 - output_singles1):")
for p in (85, 90, 95, 99):
print(f"\t{p}th percentile:", absdiff[int(len(absdiff) * p/100)])
stdout
========= output_batch1 vs output_batch2 =========
allclose: True
======= output_singles1 vs output_singles2 =======
allclose: True
================ batch vs singles ================
allclose: False
Percentiles of abs(output_batch1 - output_singles1):
85th percentile: 1.5497208e-06
90th percentile: 1.9073486e-06
95th percentile: 1.9073486e-06
99th percentile: 3.8146973e-06
Version information:
Python 3.8.3:
- torch==1.7.0
- torchvision==0.8.0
- numpy==1.19.2
Ubuntu 18.04.5 LTS (Bionic Beaver)
- Running program inside Docker container using
FROM pytorch/pytorch:1.7.0-cuda11.0-cudnn8-runtime
as parent image.
Thanks for the help!
Related topic:
Encountered similar phenomena (varying accuracies respective to batch size), but no explanation