Why time.time() in python is inaccurte?

I think it is an elementary question about programming with GPU.

First, i tried to use time.time() in python module, to measure the operation time of some modules in NNs.
such as

def forward(self, x):
  end = time.time()
  output1 = self.layer1(x)
  time_output1 = time.time()
  output2 = self.layer2(output1)
  time_output2 = time.time()
  print(time_output1 - end, time_output2-end)

However, i found that the time information is inaccurate, and i have to use below link:

Although i measure the correct time by the methods in the link,
i want to know why profiling with time.time() gives inaccurate results.

an example: Are there results with other normalizations? · Issue #66 · facebookresearch/moco · GitHub

Thank you !

It isn’t. While there are more refined measures, there isn’t anything wrong with plain timing. Apparently there first are lots of people doing it wrong (both beginners and people with considerable experience) and then inaccurate representations of what exactly is wrong (“can’t use time.time” Edit: actually it is true you should not use it but time.perf_counter(!)): The main things to get right is warm-up and synchronization.

The thing is that if you use the GPU, unless you call torch.cuda.synchronize() before taking the time (for both start and finish), you don’t know what has been executed before and after the time taking.
I invariably use the following pattern:

def do_stuff():
   for _ in range(100): # or 1000 or whatever, depending on how long it takes
      do_my_computation()
   torch.cuda.synchronize()

do_stuff()
%timeit do_it()

Of course, you need to divide the time by whatever size of the loop you have. I usually aim to have something in the msec range or so.

What this does:
This does run the operator (do_my_computation) multiple times between syncs, this would reduce the influence of the synchronization (which takes time) on the measurement.

Calling do_stuff() before the timing does:

  • Warm-up (e.g. some things compile kernels on the fly when called for the first time etc.)
  • Synchronize before starting the timing

Timing with do_stuff() ensures that synchronization happens after each run (and thus implicitly before the next).

You can do essentially the same thing with time.time time.perf_counter() before and after what is %timeit here, except that timeit will actually call do_stuff several times and do some stats to help you along. There also is the timeit module which is similar but you need to adjust the number of runs manually to the duration of your computation.

That said, the profiler gives you more detailed information with very little effort.

Best regards

Thomas

3 Likes

Hello tom. first of all, thank you for fast and detailed reply !

After i read, i tried to use time.time() and torch.cuda.synchronize() to profile the elapsed times.
However, the results are still inaccurate.

For example of https://github.com/facebookresearch/moco/issues/66,

i) time.time() w/o torch.cuda.synchronize()

  • shuffle time: 0.5993 s
  • inf_time: 0.1185 s

ii) use torch.cuda.Event & torch.cuda.synchronize()

  • shuffle time: 2.72 ms
  • inf_time: 59.88 ms

iii) use time.time() w/ torch.cuda.synchronize()

  • shuffle time: 0.0649 s
  • inf_time: 0.0587 s

Although i use torch.cuda.synchronize(),
the shuffle time is still over-estimated.

1 Like

Ah, wait. The other thing you should do is use time.perf_counter() instead of time.time(). This is because time.time() isn’t guaranteed to actually give valid differences, but you need to use a monotonic clock for that.

1 Like

Thanks Tom.
I checked both time.perf_counter() and time.process_time() with torch.cuda.synchronize(), and got similar results to time.time()

iv) use time.perf_counter() w/ torch.cuda.synchronize()

  • shuffle time: 0.0650 s
  • inf time: 0.0587 s

v) use time.process_time() w/ torch.cuda.synchronize()

  • shuffle time: 0.0879 s
  • inf time: 0.0584 s

When comparing all the results, the inference time is consistent,
but the shuffle time is inconsistent by the profiling method.

the shuffle time is shuffleBN is Moco


which gathers all samples, shuffles the index, and reallocates mini-batch to each GPU.

I cannot infer a reason why only the record time of this shuffle operation has varied by measuring method.

That might be a lot clearer if you

  • specify what exactly you want to measure,
  • time these bits in isolation,
  • break down the entire thing into the bits you want to measure (i.e. do your parts reconcile to the total? if not, where are overlaps or gaps in the parts).

The links don’t seem to show the actual measurement you inserted.

In the forward pass of moco, i am measuring the shuffling time, that is

    def forward(self, im_q, im_k):
        (....)
        # compute key features
        with torch.no_grad():  # no gradient to keys
            self._momentum_update_key_encoder()  # update the key encoder
            
            start_time = time.time()
            # shuffle for making use of BN
            im_k, idx_unshuffle = self._batch_shuffle_ddp(im_k)
            torch.cuda.synchronize()
            end_time = time.time()
            shuffle_time = end_time - start_time
            (...)

So the immediate takeaway from the above discussion is

  1. replace time.time with time.perf_counter()
  2. have a torch.cuda.synchronize() before taking the the start_time,
  3. maybe don’t take the first batch (of a given size)
3 Likes

time.time()

i) time.time() w/o torch.cuda.synchronize()

  • shuffle time: 0.5993 s
  • inf_time: 0.1185 s

ii) time.time() w/o torch.cuda.synchronize() both before and after an operation

  • shuffle time: 0.0018 s
  • inf_time: 0.031 s

time.perf_counter()

i) use time.perf_counter() w/ torch.cuda.synchronize()

  • shuffle time: 0.0650 s
  • inf time: 0.0587 s

ii) use time.perf_counter() w/ torch.cuda.synchronize() both before and after an operation

  • shuffle time: 0.0021 s
  • inf time: 0.0309 s

time.process_time()

i) use time.process_time() w/ torch.cuda.synchronize()

  • shuffle time: 0.0879 s
  • inf time: 0.0584 s

ii) use time.process_time() w/ torch.cuda.synchronize() both before and after an operation

  • shuffle time: 0.001879 s
  • inf time: 0.03107 s

torch.cuda.Event

i) use torch.cuda.Event & torch.cuda.synchronize()

  • shuffle time: 2.72 ms
  • inf_time: 59.88 ms

Conclusions

  • When time profiling, we can use both time module in python and torch.cuda.Event.
  • What we remember is, we must use torch.cuda.synchronize() right before and after the operations to be profiled.
  • Measurements of time and cuda.Event is some what different, but the ratio is consistent.
3 Likes

In the last,
can i ask some reasons why the time module and cuda.Event measure different elapsed times?
i am using V100x4 and DistributedDataParallel.

I must admit that I don’t know. One thing to exclude would be stochastic variation (e.g. %timeit gives you a standard deviation, so you can imagine error bars for the measurement), but I would not see 30ms->60ms doing that.
The other part is that you need a really stable environment to get reliable benchmarking, maybe something we did here changed something w.r.t. other things going on.
(But maybe @ptrblck knows something.)

Fun anecdote: A long time ago, I briefly enabled remote access to the GPU I used for benchmarking for one of fellow PyTorch devs because I somehow had a much more stable timing environment then them.

Best regards

Thomas

2 Likes