Does batch size matters on inference

I’m trying to evaluate a semantic segmentation model on Pascal Voc 2012 dataset. I’m using the mIoU metric. I’m evaluating the model with different batch sizes. For example, I tried batch size = 1 and batch size = 16 and I got different mIoU. Especially, with batch size = 1, I take smaller mIoU than with batch size =16. Does batch size affects the performance on evaluation?

Here is the code for mIoU metric:

def mIoU(pred_mask, mask, smooth=1e-10, n_classes=21):
    with torch.no_grad():
        pred_mask = pred_mask.to("cpu").contiguous().view(-1)
        mask = mask.to("cpu").contiguous().view(-1)
     
        iou_per_class = []
        for clas in range(0, n_classes): #loop per pixel class
            true_class = pred_mask == clas
            true_label = mask == clas
  
            if true_label.long().sum().item() == 0: #no exist label in this loop
                iou_per_class.append(np.nan)
            else:
                intersect = torch.logical_and(true_class, true_label).sum().float().item()
                union = torch.logical_or(true_class, true_label).sum().float().item()

                iou = (intersect + smooth) / (union + smooth)
                iou_per_class.append(iou)
        return np.nanmean(iou_per_class)

Are you comparing the single sample metric vs. the mean of the batch or are you calculating the mean of multiple samples vs. the mean of a batch?
In any case, I would probably start by comparing the results of two pre-defined samples where the mIoU is known and pre-computed and check which approach yields invalid results.

I actually compute the mean of the mIoUs of each sample when I set batch size = 1. On the other hand, when I set batch size = 16, I compute the mIoU over the current batch and then the mean over the total batches. From these two, i’m taking different results. Here is the code.

def evaluation(dataloader):
  miou = []
  deeplab.load_state_dict(torch.load("/content/drive/MyDrive/best_model1"))
  deeplab.eval()
  with torch.no_grad():
    for image, label in dataloader:
      X_batch = image.to(device).float()
      y_batch = label.to(device)
      preds = deeplab(X_batch)
      _ , predictions = torch.max(preds, dim=1)
      iou_score = mIoU(predictions , y_batch)
      miou.append(iou_score)
  
  score = np.array(miou)
  

  return np.mean(score)

If your batch sizes are not all the same (e.g., your final batch isn’t the same size as the rest), then this will not yield the same results as separately calculating the metric for each item.