R-CNN predictions change with different batch sizes

Even when using model.eval() I get different predictions when changing the batch size. I’ve found this issue when working on a project with Faster R-CNN and my own data, but I can replicate it in the tutorial “TorchVision Object Detection Finetuning Tutorial” (TorchVision Object Detection Finetuning Tutorial — PyTorch Tutorials 1.9.0+cu102 documentation), which uses Mask R-CNN.

Steps to replicate the issue:

  1. Open collab version: Google Colaboratory
  2. Run all cells
  3. Insert a new cell at the bottom with the code below and run it:
def get_device():
    if torch.cuda.is_available():
        return torch.device('cuda')  
    else:
        return torch.device('cpu')

def predict(model, image_tensors):
    """
    Generate model's prediction (bounding boxes, scores and labels) for a batch 
    of image tensors
    """
    model.eval()
    with torch.no_grad():
        predictions = model([x.to(get_device()) for x in image_tensors])
    return predictions

def generate_preds(model, batch_size):
  """
  Create dataloader for test dataset with configurable batch size.
  Generate predictions. Return a list of predictions per sample.
  """
  dataloader = torch.utils.data.DataLoader(
    dataset_test, batch_size=batch_size, shuffle=False, num_workers=2,
    collate_fn=utils.collate_fn)
  all_pred = []
  for batch in dataloader: 
    image_tensors, targets = batch
    predictions = predict(model, image_tensors)
    all_pred += predictions
  return all_pred

# Generate two sets of predictions, only change is batch size
preds1 = generate_preds(model, 1)
preds8 = generate_preds(model, 8)
assert len(preds1) == len(preds8)

# Investigate first five samples:
for x in range(5):
  print(f"\nSample {x}:")
  print("-Boxes")
  print(preds1[x]["boxes"])
  print(preds8[x]["boxes"])
  print("-Scores")
  print(preds1[x]["scores"])
  print(preds8[x]["scores"])
  print("-Labels")
  print(preds1[x]["labels"])
  print(preds8[x]["labels"])

The code above generates two sets of predictions for the test set. The first one is generated with a batch size 1 and the second with a batch size 8. The output that I get when I run that cell:

Sample 0:
-Boxes
tensor([[ 61.2343,  37.6461, 197.8525, 325.6508],
        [276.4769,  23.9664, 290.8987,  73.1913]], device='cuda:0')
tensor([[ 59.1616,  36.3829, 201.7858, 331.4406],
        [276.4261,  23.7988, 290.8489,  72.8123],
        [ 81.2091,  37.6342, 192.8113, 217.8009]], device='cuda:0')
-Scores
tensor([0.9989, 0.5048], device='cuda:0')
tensor([0.9988, 0.6410, 0.1294], device='cuda:0')
-Labels
tensor([1, 1], device='cuda:0')
tensor([1, 1, 1], device='cuda:0')

Sample 1:
-Boxes
tensor([[ 90.7305,  60.1291, 232.4859, 341.7854],
        [245.7694,  56.3715, 305.2585, 349.5301],
        [243.0723,  16.5198, 360.2888, 351.5983]], device='cuda:0')
tensor([[ 91.1201,  59.8146, 233.0968, 342.2685],
        [245.7369,  56.6024, 305.2173, 349.3939],
        [241.1119,  32.6983, 362.4162, 346.0358]], device='cuda:0')
-Scores
tensor([0.9976, 0.9119, 0.1945], device='cuda:0')
tensor([0.9975, 0.9128, 0.1207], device='cuda:0')
-Labels
tensor([1, 1, 1], device='cuda:0')
tensor([1, 1, 1], device='cuda:0')

Sample 2:
-Boxes
tensor([[281.1774,  53.5141, 428.7436, 330.3915],
        [139.6456,  23.7953, 264.7703, 330.2114]], device='cuda:0')
tensor([[281.7463,  53.2942, 429.3290, 327.9640],
        [138.7147,  23.8612, 264.6823, 332.3202]], device='cuda:0')
-Scores
tensor([0.9969, 0.9947], device='cuda:0')
tensor([0.9968, 0.9945], device='cuda:0')
-Labels
tensor([1, 1], device='cuda:0')
tensor([1, 1], device='cuda:0')

Sample 3:
-Boxes
tensor([[175.3683,  34.3320, 289.3029, 306.8307],
        [ 76.7871,  15.4444, 187.0855, 299.1662],
        [  0.0000,  45.9045,  51.3796, 222.0583],
        [319.1224,  53.0593, 377.1693, 232.7251],
        [260.2587,  55.8976, 309.0191, 229.4261],
        [ 70.2029,  27.2173, 126.4584, 234.3767],
        [ 38.0638,  55.5370,  65.4132, 164.1965],
        [ 98.7189,  91.5356, 172.5915, 295.5404],
        [ 70.1933,  56.1804, 103.6161, 218.4743]], device='cuda:0')
tensor([[175.1848,  36.0377, 288.8358, 305.3505],
        [ 76.8171,  15.7485, 187.4645, 299.5779],
        [  0.0000,  45.9045,  51.3796, 222.0582],
        [319.1060,  53.0140, 377.3391, 232.7926],
        [260.2587,  55.8976, 309.0191, 229.4261],
        [ 70.2030,  27.2173, 126.4584, 234.3767],
        [ 38.0638,  55.5370,  65.4132, 164.1965],
        [ 70.1933,  56.1804, 103.6161, 218.4743]], device='cuda:0')
-Scores
tensor([0.9968, 0.9959, 0.9942, 0.9937, 0.9271, 0.8133, 0.4273, 0.1163, 0.0884],
       device='cuda:0')
tensor([0.9974, 0.9965, 0.9942, 0.9937, 0.9271, 0.8133, 0.4273, 0.0884],
       device='cuda:0')
-Labels
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1], device='cuda:0')
tensor([1, 1, 1, 1, 1, 1, 1, 1], device='cuda:0')

Sample 4:
-Boxes
tensor([[318.0241,  60.4089, 450.3268, 348.4254],
        [167.0622,  27.6761, 242.5035, 316.6244],
        [221.8452,  26.9947, 310.0547, 291.2983],
        [295.6860,  23.4690, 379.8831, 260.1526],
        [140.3205,  44.4713, 223.6427, 281.9173],
        [141.0462,  24.9851, 313.7406, 301.5022],
        [252.8210,  28.4908, 358.8223, 261.0169]], device='cuda:0')
tensor([[317.8378,  63.2861, 450.5063, 350.6856],
        [167.0629,  27.6768, 242.5045, 316.6241],
        [221.8452,  26.9948, 310.0548, 291.2983],
        [295.6860,  23.4690, 379.8831, 260.1525],
        [142.1777,  24.9079, 313.1906, 302.9822],
        [140.3205,  44.4713, 223.6428, 281.9174],
        [252.8209,  28.4907, 358.8222, 261.0172]], device='cuda:0')
-Scores
tensor([0.9969, 0.9948, 0.9910, 0.9733, 0.1821, 0.1696, 0.0668],
       device='cuda:0')
tensor([0.9968, 0.9948, 0.9910, 0.9733, 0.1832, 0.1821, 0.0668],
       device='cuda:0')
-Labels
tensor([1, 1, 1, 1, 1, 1, 1], device='cuda:0')
tensor([1, 1, 1, 1, 1, 1, 1], device='cuda:0')

As far as I know, the prediction output for each sample for the batch size 1 and batch size 8 should be the same. However, there are differences in scores, bounding boxes, number of proposals…

Any help will be appreciated :slight_smile:

I got an answer in github (R-CNN predictions change with different batch sizes · Issue #4257 · pytorch/vision · GitHub):

This is expected and is due to padding the differently-sized input images with zeros so that all images have the same size (for batching).

If you crop all images to have the same size before feeding them to the model, the batch size shouldn’t influence the predictions.