I wanted to run a simple matmul benchmark on GPU. Following the PyTorch Benchmark tutorial, I have written the following code:
import torch
import torch.utils.benchmark as benchmark
# Prepare matrices
N = 10000
A = torch.randn(N, N)
B = torch.randn(N, N)
# Send to device
device = torch.device("cuda")
A = A.to(device)
B = B.to(device)
# Benchmark
n_threads = 1
t = benchmark.Timer(
stmt="A @ B",
globals={"A": A, "B": B},
num_threads=n_threads,
)
m = t.blocked_autorange(min_run_time=3)
# Print results
print(m)
print(f"Mean: {m.mean * 1e3:6.2f} ms"
+ f" | First: {m._sorted_times[0] *1e3:6.2f} ms"
+ f" | Median: {m.median *1e3:6.2f} ms"
+ f" | Last: {m._sorted_times[-1] *1e3:6.2f} ms.")
which prints out
<torch.utils.benchmark.utils.common.Measurement object at 0x7ff2e2546250>
A @ B
Median: 25.13 us
IQR: 0.50 us (24.92 to 25.42)
541 measurements, 1 runs per measurement, 1 thread
Mean: 5.56 ms | First: 0.02 ms | Median: 0.03 ms | Last: 99.97 ms.
As you can see, there’s a huge gap in runtime. The first few hundred runs take only ~20us, while the rest take ~100ms. I have a hard time believing that a matmul over such big matrices can be run in ~20us. What is happening here? Have I done something wrong in the benchmark?