How to measure time in PyTorch

I have seen lots of ways to measure time in PyTorch. But what is the most proper way to do it now (both for cpu and cuda)?
Should I clear the memory cache if I use timeit?
And is it possible to get accurate results if I’m computing on a cluster? And is it a way to make this results reproducible?
And what is better: timeit or profiler?


There are many things you can do CPU-only benchmarking: I’ve used timeit as well as profilers.

CUDA is asynchronous so you will need some tools to measure time. CUDA events are good for this if you’re timing “add” on two cuda tensors, you should sandwich the call between CUDA events:

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

z = x + y

# Waits for everything to finish running


The pytorch autograd profiler is a good way to get timing information as well: It uses the cuda event api under the hood and is easy to use:

with torch.autograd.profiler.profile(use_cuda=True) as prof:
   // do something

It’ll tell you the CPU and CUDA timings of your functions.


Thank you!
But does profiler perform synchronization during time measurement?

1 Like

Thank you for the examples!

How important is to perform some dry runs before time measurements?

One is usually enough, the main reason for a dry-run is to put your CPU and GPU on maximum performance state. This is especially useful for laptops as laptops CPU are all on powersaving by default.

CPU and GPU are very quick to switch to the maximum performance test so just doing a 3000x3000 matrix multiplication before the actual benchmark should be enough and takes a couple seconds at most.

Caveat: on some CPUs, AVX2 workload will downcloak the CPU frequency (and AVX512 is worse)


Let me apologize for digging this topic, but there is one thing I want to make sure I understand correctly. Let’s say I have a training loop that performs both CPU bound and GPU bound operations in a sequence. It:

  1. Loads a batch of data from the disc
  2. Converts the data to PyTorch Tensors on a selected device
  3. Do the necessary calculations(forward, loss, backward)

The script runs slower then I expected it to run and I wanted to time the respective blocks to see what is the contribution of each of this major actions on the time of processing one batch. My first approach was to try with the python’s time module, but it fooled me and I ended up optimizing a function that should take just a fraction of the total execution time for a whole day, and then when I used the method described in @richard answer, it turned out that this small function really is meaningless when it comes to the total execution time. So, my question is now, is the above described method device agnostic? Can I also use the cuda events to time the parts of code that are fully CPU bound, like loading the data from disk? Will the time printed by the below snippet give me an accurate result?

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)

some CPU bound operations, i.e. loading data...

# Waits for everything to finish running


I would be grateful for clearing this out for me, I just want to avoid misunderstanding.


Short answer it will work.

Long answer.
Cuda Events are something which will be marked when cuda starts running some code.
The cpu will just dispatch it async to the GPU.
So when cpu hits start.record() it send it to the GPU and GPU records the time when it starts executing.
Now whatever the code be after this, if CPU bound CPU will execute it or if GPU bound CPU will queue it in the cuda stream and keep moving. Now when CPU hits end.record() it again queues it in stream and when GPU executes it will record the time.
The magic comes from torch.cuda.synchronize() which is a synchronous call, and makes sure that all cuda commands dispatched before are done. In case of CPU bound code there are only two calls start.record() and end.record() so they will be done.


I am a little confused by the operations ordering:
Shouldn’t it be:

z = x + y

# Waits for everything to finish running

I have actually tried before and after and it gives different results, which I did not expect…
Can someone explains why there is a difference ?


@theevann so which one is the right way? Thanks.


We have a question that how to use the API torch.autograd.profiler.profile to compute the total time ?


@theevann @cognitiverobot
Short answer: You want synchronize() after end.record() as in the original post above.

Long answer:
The way this works is the function call and the events are added to a CUDA stream object, in order. Each kernel and event in a stream is executed in order, synchronously. The only asynchronous part (for a single stream) is with respect to the CPU. When you call torch.cuda.synchronize() it will wait for the GPU to complete all events in the (default) stream, before returning control to the CPU. So you want everything, including your end event, before the synchronization call. You can read more about it in a nice NVIDIA blog post here.


Does this answer apply when running on multiple GPUs ?
If I want to measure the time for model inference on multiple GPUs (4 Tesla), will CUDA events help measure the overall GPU execution time ?

I’ve tried on colab but find

t0 = time.time()
outputs = net(x)
t1 = time.time()

gives a more accurate measurement…


Is this time based on ms or s?

1 Like