Sub results different for tensor shape 1 and 32

Why do torch.sub produce different results for the same data when shape is 1 and 32?

def fn(shape):
  alpha = -2.278090e-04
  x = torch.empty(shape, dtype=torch.float16).fill_(0.004044)
  y = torch.empty(shape, dtype=torch.float16).fill_(-17.39)

  print(torch.sub(x, y, alpha=alpha)[0])

fn(32)
fn(1)

prints

tensor(8.1837e-05, dtype=torch.float16)
tensor(8.0109e-05, dtype=torch.float16)

This is caused by the limited floating point precision and a different order of operations depending on the used algorithm as also seen e.g. in this simple example:

x = torch.randn(100, 100)
s1 = x.sum()
s2 = x.sum(0).sum(0)
print(s1 - s2)
# tensor(1.5259e-05)

For elementwise, could you elaborate on

different order of operations

I was wondering if this is due to vectorized instructions used for shape > 1.

Yes, maybe. You could check which kernel is called via a profiler for your use case exactly.