These problem sizes are small enough that they are essentially “free” compared to the cost of launching the kernels and dispatching from Python.
For example, on an A6000, we need around 1e5 iterations to even get a stable measurement of the time per iteratIon:
# cat matmul.py
import time
import torch
iters = 100000
a = torch.randn(10, 10, device='cuda')
b = torch.randn(100, 10, device='cuda')
torch.cuda.synchronize()
t1 = time.perf_counter()
for _ in range(iters):
torch.matmul(a, a)
torch.cuda.synchronize()
t2 = time.perf_counter()
print(f"10,10 x 10,10 took {t2-t1}, {(t2-t1)/iters} per iter")
torch.cuda.synchronize()
t1 = time.perf_counter()
for _ in range(iters):
torch.matmul(b, a)
torch.cuda.synchronize()
t2 = time.perf_counter()
print(f"100,10 x 10,10 took {t2-t1}, {(t2-t1)/iters} per iter")
# python matmul.py
10,10 x 10,10 took 0.6885399222373962, 6.885399222373963e-06 per iter
100,10 x 10,10 took 0.6861815741285682, 6.861815741285682e-06 per iter
I would check if you see the same behavior on a larger model or with a greater difference in input sizes.