It isn’t. While there are more refined measures, there isn’t anything wrong with plain timing. Apparently there first are lots of people doing it wrong (both beginners and people with considerable experience) and then inaccurate representations of what exactly is wrong (“can’t use time.time” Edit: actually it is true you should not use it but time.perf_counter(!)): The main things to get right is warm-up and synchronization.
The thing is that if you use the GPU, unless you call torch.cuda.synchronize() before taking the time (for both start and finish), you don’t know what has been executed before and after the time taking.
I invariably use the following pattern:
def do_stuff():
for _ in range(100): # or 1000 or whatever, depending on how long it takes
do_my_computation()
torch.cuda.synchronize()
do_stuff()
%timeit do_it()
Of course, you need to divide the time by whatever size of the loop you have. I usually aim to have something in the msec range or so.
What this does:
This does run the operator (do_my_computation) multiple times between syncs, this would reduce the influence of the synchronization (which takes time) on the measurement.
Calling do_stuff() before the timing does:
Warm-up (e.g. some things compile kernels on the fly when called for the first time etc.)
Synchronize before starting the timing
Timing with do_stuff() ensures that synchronization happens after each run (and thus implicitly before the next).
You can do essentially the same thing with time.time time.perf_counter() before and after what is %timeit here, except that timeit will actually call do_stuff several times and do some stats to help you along. There also is the timeit module which is similar but you need to adjust the number of runs manually to the duration of your computation.
That said, the profiler gives you more detailed information with very little effort.
Ah, wait. The other thing you should do is use time.perf_counter() instead of time.time(). This is because time.time() isn’t guaranteed to actually give valid differences, but you need to use a monotonic clock for that.
break down the entire thing into the bits you want to measure (i.e. do your parts reconcile to the total? if not, where are overlaps or gaps in the parts).
The links don’t seem to show the actual measurement you inserted.
I must admit that I don’t know. One thing to exclude would be stochastic variation (e.g. %timeit gives you a standard deviation, so you can imagine error bars for the measurement), but I would not see 30ms->60ms doing that.
The other part is that you need a really stable environment to get reliable benchmarking, maybe something we did here changed something w.r.t. other things going on.
(But maybe @ptrblck knows something.)
Fun anecdote: A long time ago, I briefly enabled remote access to the GPU I used for benchmarking for one of fellow PyTorch devs because I somehow had a much more stable timing environment then them.