[Question] How to get stable torch.cuda.Event timings for reliable benchmarking?
1. The Goal & Problem
I am trying to reliably benchmark a DSA (DeepSeek Sparse Attention) kernel to understand its performance.
However, torch.cuda.Event timings show significant variance (~10-30%) even after multiple warmups, averaging, and other standard practices. This noise makes it difficult to compare optimizations or determine performance bottlenecks accurately.
2. Minimal Example (GEMM)
The actual DSA code is integrated within vLLM. However, this simpler GEMM benchmark demonstrates the same timing instability.
import torch
import statistics
def benchmark_kernel(num_iterations=10):
device = torch.device("cuda")
x = torch.randn(10000, 6000, device=device)
y = torch.randn(6000, 2000, device=device)
# Warmup
for _ in range(5):
_ = torch.matmul(x, y)
torch.cuda.synchronize()
# Measurement
timings = []
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
for _ in range(num_iterations):
start_event.record()
# Simulate a kernel with multiple operations
z = torch.matmul(x, y)
z = torch.matmul(x, y)
z = torch.matmul(x, y)
end_event.record()
torch.cuda.synchronize()
timings.append(start_event.elapsed_time(end_event))
print(f"Mean: {statistics.mean(timings):.3f} ms")
print(f"Min: {min(timings):.3f} ms, Max: {max(timings):.3f} ms")
return timings
if __name__ == "__main__":
benchmark_kernel()
GEMM Results:
Mean: 15.373 ms
Min: 15.352 ms, Max: 15.423 ms
(Note: While this specific GEMM example is relatively stable, the variance is much higher in my actual, more complex sparse attention kernel as shown below.)
3. Real-World Variance (Sparse Attention Kernel)
When measuring my actual target—the DSA forward module in vLLM—the variance is much more pronounced.
Based on 10 measurements (context length 512):
- Min: 1.601 ms
- Max: 2.295 ms
- Mean: 1.831 ms
This ~30% spread between min and max makes it impossible to reliably calculate the indexer/DSA time ratio, which is my primary goal.
4. Environment
- GPU: NVIDIA H100 80GB HBM3
- Software: PyTorch 2.8.0+cu128, CUDA 12.8, Driver 550.54.15, Ubuntu 22.04.5 LTS
5. What I’ve Tried
- Warmup Iterations (5 iterations)
- Averaging Multiple Runs (10 iterations)
- Outlier Removal (trimming min/max values)
torch.cuda.synchronize()after each operation.
6. Questions
- Is this level of timing variance expected for complex kernels?
- Are there better, more stable methods or “best practices” for benchmarking with PyTorch on CUDA beyond what I’ve tried?
- Could this be related to GPU power states, scheduler jitter, or something else I can control?
Any advice on how to achieve more stable and reproducible timings would be greatly appreciated. If this is a known issue, pointers to relevant documentation would also be helpful.