There are many things you can do CPU-only benchmarking: I’ve used timeit as well as profilers.
CUDA is asynchronous so you will need some tools to measure time. CUDA events are good for this if you’re timing “add” on two cuda tensors, you should sandwich the call between CUDA events:
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
z = x + y
end.record()
# Waits for everything to finish running
torch.cuda.synchronize()
print(start.elapsed_time(end))
The pytorch autograd profiler is a good way to get timing information as well: https://pytorch.org/docs/stable/autograd.html?highlight=autograd%20profiler#torch.autograd.profiler.profile. It uses the cuda event api under the hood and is easy to use:
with torch.autograd.profiler.profile(use_cuda=True) as prof:
// do something
print(prof)
It’ll tell you the CPU and CUDA timings of your functions.