Matrix multiplication in torch is slower than in numpy for small matrices

torch’s matrix multiplication speed is increasing as the number of multiplications increases for small matrices

def timer(func, reps, *args, **kwargs):
    s = perf_counter()
    for i in range(reps):
        func(*args, **kwargs)
    e = perf_counter()

    return e - s

y = list(range(16))
t = torch.Tensor(y * 16).reshape((16, 16))
u = torch.Tensor(y)

v = t.numpy()
w = u.numpy()

n = 8
time_torch_cpu = [timer(torch.matmul, 10**i, t, u) for i in range(1, n)]
time_numpy = [timer(, 10**i, v, w) for i in range(1, n)]

x = [i for i in range(1, n)]

plt.plot(x, time_numpy, color='red', label='numpy')
plt.plot(x, time_torch_cpu, color='blue', label='pytorch cpu')
plt.ylabel('time (seconds)')
plt.xlabel('No: of multiplications (10^i)')


This might be expected as PyTorch would most likely add more overhead to its calls due to the Autograd bookkeeping and dispatching.


Even when I use torch.inference_mode(), there is only a small improvement. Is there any way to further increase the speed