How to measure time in PyTorch


(Талгат) #1

I have seen lots of ways to measure time in PyTorch. But what is the most proper way to do it now (both for cpu and cuda)?
Should I clear the memory cache if I use timeit?
And is it possible to get accurate results if I’m computing on a cluster? And is it a way to make this results reproducible?
And what is better: timeit or profiler?


#2

There are many things you can do CPU-only benchmarking: I’ve used timeit as well as profilers.

CUDA is asynchronous so you will need some tools to measure time. CUDA events are good for this if you’re timing “add” on two cuda tensors, you should sandwich the call between CUDA events:

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

start.record()
z = x + y
end.record()

# Waits for everything to finish running
torch.cuda.synchronize()

print(start.elapsed_time(end))

The pytorch autograd profiler is a good way to get timing information as well: https://pytorch.org/docs/stable/autograd.html?highlight=autograd%20profiler#torch.autograd.profiler.profile. It uses the cuda event api under the hood and is easy to use:

with torch.autograd.profiler.profile(use_cuda=True) as prof:
   // do something
print(prof)

It’ll tell you the CPU and CUDA timings of your functions.


(Талгат) #3

Thank you!
But does profiler perform synchronization during time measurement?


#4

Thank you for the examples!

How important is to perform some dry runs before time measurements?


(Mamy Ratsimbazafy) #5

One is usually enough, the main reason for a dry-run is to put your CPU and GPU on maximum performance state. This is especially useful for laptops as laptops CPU are all on powersaving by default.

CPU and GPU are very quick to switch to the maximum performance test so just doing a 3000x3000 matrix multiplication before the actual benchmark should be enough and takes a couple seconds at most.

Caveat: on some CPUs, AVX2 workload will downcloak the CPU frequency (and AVX512 is worse)