Wrong matmul benchmark on GPU

I wanted to run a simple matmul benchmark on GPU. Following the PyTorch Benchmark tutorial, I have written the following code:

import torch
import torch.utils.benchmark as benchmark

# Prepare matrices
N = 10000
A = torch.randn(N, N)
B = torch.randn(N, N)

# Send to device
device = torch.device("cuda")
A = A.to(device)
B = B.to(device)

# Benchmark
n_threads = 1
t = benchmark.Timer(
    stmt="A @ B",
    globals={"A": A, "B": B},
    num_threads=n_threads,
)
m = t.blocked_autorange(min_run_time=3)

# Print results
print(m)
print(f"Mean:  {m.mean * 1e3:6.2f} ms"
      + f" | First: {m._sorted_times[0] *1e3:6.2f} ms"
      + f" | Median: {m.median *1e3:6.2f} ms"
      + f" | Last: {m._sorted_times[-1] *1e3:6.2f} ms.")

which prints out

<torch.utils.benchmark.utils.common.Measurement object at 0x7ff2e2546250>
A @ B
  Median: 25.13 us
  IQR:    0.50 us (24.92 to 25.42)
  541 measurements, 1 runs per measurement, 1 thread
Mean:    5.56 ms | First:   0.02 ms | Median:   0.03 ms | Last:  99.97 ms.

As you can see, there’s a huge gap in runtime. The first few hundred runs take only ~20us, while the rest take ~100ms. I have a hard time believing that a matmul over such big matrices can be run in ~20us. What is happening here? Have I done something wrong in the benchmark?