Why the output is different when batching?

I thought that for same input, we should receive same output, no matter the batch size.

But I tried a small experiment, and it’s not confirming my intuition…


I define a simple network :

import torch
import torch.nn as nn

smol = nn.Sequential(
    nn.Linear(256, 256),
    nn.ReLU(),
    nn.Linear(256, 256),
    nn.ReLU()
)

Then I try 2 inputs : a batch with only sample x, and a batch with sample x and y. I expect the output of sample x to be the same in both case, but it’s not :

x = torch.rand([1,  256])                           # Batch with only x 
xy = torch.cat([x, torch.rand([1,  256])], dim=0)   # Batch with x and another sample

out_alone = smol(x)[0]
out_batch = smol(xy)[0]

assert torch.equal(out_alone, out_batch), f"\n{out_alone[:15]}\n{out_alone[:15]}"

AssertionError:
tensor([0.0870, 0.0757, 0.0076, 0.0000, 0.0000, 0.0000, 0.0032, 0.0976, 0.0000,
0.0648, 0.2508, 0.0000, 0.1737, 0.0546, 0.0043],
grad_fn=)
tensor([0.0870, 0.0757, 0.0076, 0.0000, 0.0000, 0.0000, 0.0032, 0.0976, 0.0000,
0.0648, 0.2508, 0.0000, 0.1737, 0.0546, 0.0043],
grad_fn=)

Colab notebook to reproduce it


Even if it seems to be only a precision difference, why is the output not exactly the same ?

Why the output is dependent on the batch size, when the input is same ?

Different algorithms could be used for different input shapes, which would thus create the small errors due to the limited numerical precision.
E.g. a different order of operations would also create these numerical differences as seen here:

x = torch.randn(10, 10, 10)
s1 = x.sum()
s2 = x.sum(0).sum(0).sum(0)
print((s1 - s2).abs().max())
> tensor(9.5367e-07)

and you cannot assume to get bitwise-accurate results using different algorithms.

1 Like