Does batch size matters on inference

I’m trying to evaluate a semantic segmentation model on Pascal Voc 2012 dataset. I’m using the mIoU metric. I’m evaluating the model with different batch sizes. For example, I tried batch size = 1 and batch size = 16 and I got different mIoU. Especially, with batch size = 1, I take smaller mIoU than with batch size =16. Does batch size affects the performance on evaluation?

Here is the code for mIoU metric:

def mIoU(pred_mask, mask, smooth=1e-10, n_classes=21):
    with torch.no_grad():
        pred_mask ="cpu").contiguous().view(-1)
        mask ="cpu").contiguous().view(-1)
        iou_per_class = []
        for clas in range(0, n_classes): #loop per pixel class
            true_class = pred_mask == clas
            true_label = mask == clas
            if true_label.long().sum().item() == 0: #no exist label in this loop
                intersect = torch.logical_and(true_class, true_label).sum().float().item()
                union = torch.logical_or(true_class, true_label).sum().float().item()

                iou = (intersect + smooth) / (union + smooth)
        return np.nanmean(iou_per_class)

Are you comparing the single sample metric vs. the mean of the batch or are you calculating the mean of multiple samples vs. the mean of a batch?
In any case, I would probably start by comparing the results of two pre-defined samples where the mIoU is known and pre-computed and check which approach yields invalid results.

I actually compute the mean of the mIoUs of each sample when I set batch size = 1. On the other hand, when I set batch size = 16, I compute the mIoU over the current batch and then the mean over the total batches. From these two, i’m taking different results. Here is the code.

def evaluation(dataloader):
  miou = []
  with torch.no_grad():
    for image, label in dataloader:
      X_batch =
      y_batch =
      preds = deeplab(X_batch)
      _ , predictions = torch.max(preds, dim=1)
      iou_score = mIoU(predictions , y_batch)
  score = np.array(miou)

  return np.mean(score)