Inconsistent relative matmul times across machines

Hey Folks, I was trying to time-test the benefit of matrix factorization and have gotten wildly different results across machines.

import timeit
import torch

x = torch.rand(1000)

w = torch.rand([5000, 1000])

a = torch.rand([5000, 100])
b = torch.rand([100, 1000])

print("wx   ", timeit.timeit(lambda: w@x, number=10000))
print("a(bx)", timeit.timeit(lambda: a@(b@x), number=10000))

On machine A:

wx    1.0115966480225325
a(bx) 0.8654864309355617

On machine B:

wx    2.858995377959218
a(bx) 0.16533887601690367

On machine C:

wx    2.5966870239935815
a(bx) 2.271036416757852

On machine D:

wx    1.2024586703628302
a(bx) 4.790976291522384

These results are consistent and reproducible per machine.

I care about the relative timing per machine, not the absolute. One machine might be faster, but should still scale relative to the size of the matrix.

The measurements are not affected by pytorch version, having tried 1.5.0 and 1.12.1
And the dtype is float32 and device is cpu, across the machines.

Any idea why this is so machine dependent?

Thanks in advance!