Why does the size of a batch affect the features extracted from a pre-trained model in eval mode?

Hi all,
I realized that when I use different batch_size values in torch.utils.data.DataLoader, I end up with slightly different feature vectors (which sometimes affects the model predictions) although I use both model.eval() and torch.no_grad() while extracting features.

To replicate this issue, I created the following test scenario. I try to extract features for 200 images where the first 100 images are exactly the same as the last 100 (in the exact same order). I simply use default pre-trained VGG16 network for this experiment. I test with batch sizes of 20, 50, and 180.

The daunting observations are in cells 12 onwards. Here are my quick notes:

  • Cell 12 shows that the input tensors for images i and i+100 are the same.
  • Cell 13 shows that output tensors (obtained with different batch sizes in dataloader) for image i are different.
  • Cell 14 shows that output tensors for images i and i+100 are the same for batch_size=20 and batch_size=50 whereas the tensors differ for batch_size=180.
  • Cell 14 also shows that the output tensor for image i when batch_size=20 is equal to the output tensor for image i+100 when batch_size=180 (probably because there are only 20 images left for the second batch of dataloader180 although batch_size was set to 180).

I think the last point is the most important one because it shows that the issue is probably not exactly about the actual value of batch_size but it is more about how many images there are actually in a batch waiting to be processed.

Unfortunately, these subtle-looking variations in feature values may yield different predictions at test time. I am not sure if there is a fix to this issue, but what should be the rule of thumb?
Would it be better to always use a batch_size=1 while testing a model or extracting features?