Running time and BP under condition

These problem sizes are small enough that they are essentially “free” compared to the cost of launching the kernels and dispatching from Python.
For example, on an A6000, we need around 1e5 iterations to even get a stable measurement of the time per iteratIon:

# cat matmul.py
import time
import torch

iters = 100000

a = torch.randn(10, 10, device='cuda')
b = torch.randn(100, 10, device='cuda')

torch.cuda.synchronize()
t1 = time.perf_counter()
for _ in range(iters):
  torch.matmul(a, a)
torch.cuda.synchronize()
t2 = time.perf_counter()

print(f"10,10 x 10,10 took {t2-t1}, {(t2-t1)/iters} per iter")

torch.cuda.synchronize()
t1 = time.perf_counter()
for _ in range(iters):
  torch.matmul(b, a)
torch.cuda.synchronize()
t2 = time.perf_counter()

print(f"100,10 x 10,10 took {t2-t1}, {(t2-t1)/iters} per iter")
# python matmul.py
10,10 x 10,10 took 0.6885399222373962, 6.885399222373963e-06 per iter
100,10 x 10,10 took 0.6861815741285682, 6.861815741285682e-06 per iter

I would check if you see the same behavior on a larger model or with a greater difference in input sizes.