Why does the size of a batch affect the features extracted from a pre-trained model in eval mode?

cvenoburi · October 31, 2019, 7:31am

Hi all,
I realized that when I use different batch_size values in torch.utils.data.DataLoader, I end up with slightly different feature vectors (which sometimes affects the model predictions) although I use both model.eval() and torch.no_grad() while extracting features.

To replicate this issue, I created the following test scenario. I try to extract features for 200 images where the first 100 images are exactly the same as the last 100 (in the exact same order). I simply use default pre-trained VGG16 network for this experiment. I test with batch sizes of 20, 50, and 180.

github.com

cvenoburi/embeddings_batchsizes/blob/master/compare_embeddings_and_batch_size.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "PyTorch Version:  1.3.0\n",
      "Torchvision Version:  0.4.1a0+d94043a\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"1\"\n",
    "import sys\n",

This file has been truncated. show original

The daunting observations are in cells 12 onwards. Here are my quick notes:

Cell 12 shows that the input tensors for images i and i+100 are the same.
Cell 13 shows that output tensors (obtained with different batch sizes in dataloader) for image i are different.
Cell 14 shows that output tensors for images i and i+100 are the same for batch_size=20 and batch_size=50 whereas the tensors differ for batch_size=180.
Cell 14 also shows that the output tensor for image i when batch_size=20 is equal to the output tensor for image i+100 when batch_size=180 (probably because there are only 20 images left for the second batch of dataloader180 although batch_size was set to 180).

I think the last point is the most important one because it shows that the issue is probably not exactly about the actual value of batch_size but it is more about how many images there are actually in a batch waiting to be processed.

Unfortunately, these subtle-looking variations in feature values may yield different predictions at test time. I am not sure if there is a fix to this issue, but what should be the rule of thumb?
Would it be better to always use a batch_size=1 while testing a model or extracting features?

Thanks!