Results vary by batch size for MobileNet_V2 model on GPU

I’ve taken steps from the docs on reproducibility, and I turned on eval mode so that batchnorm should not be affecting the results. I still get different results for different batch sizes for the same model. This does not happen when I use the CPU. Anything I’m missing?

import os
import random

import numpy as np
import torch
from torch import hub
from torchvision.models import MobileNet_V2_Weights

os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"

np.random.seed(0)
random.seed(0)
torch.manual_seed(0)

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.use_deterministic_algorithms(True)

model = hub.load(
    "pytorch/vision:v0.6.0",
    "mobilenet_v2",
    weights=MobileNet_V2_Weights.IMAGENET1K_V1,
    verbose=False,
).to("cuda:0").eval()

_input = torch.ones(size=(4,3,224,224), device="cuda:0")

output = model(_input)
half_output = model(_input[:2])

output, half_output
(tensor([[-0.2532,  0.0923, -0.4648,  ..., -1.6945, -0.1472,  0.9123],
...
        device='cuda:0', grad_fn=<AddmmBackward0>),
 tensor([[-0.2558,  0.0903, -0.4674,  ..., -1.6931, -0.1455,  0.9128],
...
        device='cuda:0', grad_fn=<AddmmBackward0>))

I also saw this: Different outputs when using different batch size (only on cuda) - #2 by ptrblck

Does this constitute a different workload ?

Yes, different batch sizes can constitute different workloads. If you wish to verify that this is the case, you could check the kernels that are actually being executed for each case by e.g., running nsys nvprof python mymodel.py and comparing the profiler results with different batch sizes.