Would batch size/order affect the behavior of BatchNorm or any other layer when in eval mode?

I have a model trained with batch size 16, and when I evaluate at batch size 16, I get my expected results. When I change the value of batchsize in evaluation, as the batch size decreases, results get worse.

Likewise, shuffling the Dataloaders in evaluation improves results. I assume it’s because shuffled batches are more balanced than the fixed order files, but I don’t know why that would affect results.

I think this is an error in my code but I’d like to ask if I’m misunderstanding intended behavior before I start working on it. I’m using Google Colab and using the same network as the one here, if that is relevant.

No, the batch size should not have any effect on BatchNorm layers during eval() besides expected small errors potentially due to the limited floating point precision caused by a different order of operations.
Your model also works for me and doesn’t show any difference:

model = UNet(3, 10)
model.eval()
x = torch.randn(10, 3, 224, 224)
# create reference with full batch
out = model(x)
# iterate all samples separately
for idx, x_ in enumerate(x):
x_ = x_.unsqueeze(0)
out_single = model(x_)
print((out_single.squeeze(0) - out[idx]).abs().max())
# tensor(0., grad_fn=<MaxBackward1>)
# tensor(0., grad_fn=<MaxBackward1>)
# tensor(0., grad_fn=<MaxBackward1>)
# tensor(0., grad_fn=<MaxBackward1>)
# tensor(0., grad_fn=<MaxBackward1>)
# tensor(0., grad_fn=<MaxBackward1>)
# tensor(0., grad_fn=<MaxBackward1>)
# tensor(0., grad_fn=<MaxBackward1>)
# tensor(0., grad_fn=<MaxBackward1>)
# tensor(0., grad_fn=<MaxBackward1>)

Your mentioning precision errors was very helpful. I found that the code I copied was using a long to store the mask values so the precision errors were quite high