Why time.time() in python is inaccurte?

LeeDoYup · August 27, 2020, 8:59am

I think it is an elementary question about programming with GPU.

First, i tried to use time.time() in python module, to measure the operation time of some modules in NNs.
such as

def forward(self, x):
  end = time.time()
  output1 = self.layer1(x)
  time_output1 = time.time()
  output2 = self.layer2(output1)
  time_output2 = time.time()
  print(time_output1 - end, time_output2-end)

However, i found that the time information is inaccurate, and i have to use below link:

Although i measure the correct time by the methods in the link,
i want to know why profiling with time.time() gives inaccurate results.

an example: Are there results with other normalizations? · Issue #66 · facebookresearch/moco · GitHub

Thank you !

tom · August 27, 2020, 11:03am

It isn’t. While there are more refined measures, there isn’t anything wrong with plain timing. Apparently there first are lots of people doing it wrong (both beginners and people with considerable experience) and then inaccurate representations of what exactly is wrong (“can’t use time.time” Edit: actually it is true you should not use it but time.perf_counter(!)): The main things to get right is warm-up and synchronization.

The thing is that if you use the GPU, unless you call torch.cuda.synchronize() before taking the time (for both start and finish), you don’t know what has been executed before and after the time taking.
I invariably use the following pattern:

def do_stuff():
   for _ in range(100): # or 1000 or whatever, depending on how long it takes
      do_my_computation()
   torch.cuda.synchronize()

do_stuff()
%timeit do_it()

Of course, you need to divide the time by whatever size of the loop you have. I usually aim to have something in the msec range or so.

What this does:
This does run the operator (do_my_computation) multiple times between syncs, this would reduce the influence of the synchronization (which takes time) on the measurement.

Calling do_stuff() before the timing does:

Warm-up (e.g. some things compile kernels on the fly when called for the first time etc.)
Synchronize before starting the timing

Timing with do_stuff() ensures that synchronization happens after each run (and thus implicitly before the next).

You can do essentially the same thing with ~~time.time~~ time.perf_counter() before and after what is %timeit here, except that timeit will actually call do_stuff several times and do some stats to help you along. There also is the timeit module which is similar but you need to adjust the number of runs manually to the duration of your computation.

That said, the profiler gives you more detailed information with very little effort.

Best regards

Thomas

LeeDoYup · August 27, 2020, 11:52pm

Hello tom. first of all, thank you for fast and detailed reply !

After i read, i tried to use time.time() and torch.cuda.synchronize() to profile the elapsed times.
However, the results are still inaccurate.

For example of https://github.com/facebookresearch/moco/issues/66,

i) time.time() w/o torch.cuda.synchronize()

shuffle time: 0.5993 s
inf_time: 0.1185 s

ii) use torch.cuda.Event & torch.cuda.synchronize()

shuffle time: 2.72 ms
inf_time: 59.88 ms

iii) use time.time() w/ torch.cuda.synchronize()

shuffle time: 0.0649 s
inf_time: 0.0587 s

Although i use torch.cuda.synchronize(),
the shuffle time is still over-estimated.

tom · August 28, 2020, 3:14am

Ah, wait. The other thing you should do is use time.perf_counter() instead of time.time(). This is because time.time() isn’t guaranteed to actually give valid differences, but you need to use a monotonic clock for that.

LeeDoYup · August 28, 2020, 6:48am

Thanks Tom.
I checked both time.perf_counter() and time.process_time() with torch.cuda.synchronize(), and got similar results to time.time()

iv) use time.perf_counter() w/ torch.cuda.synchronize()

shuffle time: 0.0650 s
inf time: 0.0587 s

v) use time.process_time() w/ torch.cuda.synchronize()

shuffle time: 0.0879 s
inf time: 0.0584 s

When comparing all the results, the inference time is consistent,
but the shuffle time is inconsistent by the profiling method.

the shuffle time is shuffleBN is Moco

github.com

facebookresearch/moco/blob/78b69cafae80bc74cd1a89ac3fb365dc20d157d3/moco/builder.py#L133



# compute query features
q = self.encoder_q(im_q)  # queries: NxC
q = nn.functional.normalize(q, dim=1)

# compute key features
with torch.no_grad():  # no gradient to keys
    self._momentum_update_key_encoder()  # update the key encoder

    # shuffle for making use of BN
    im_k, idx_unshuffle = self._batch_shuffle_ddp(im_k)

    k = self.encoder_k(im_k)  # keys: NxC
    k = nn.functional.normalize(k, dim=1)

    # undo shuffle
    k = self._batch_unshuffle_ddp(k, idx_unshuffle)

# compute logits
# Einstein sum is more intuitive
# positive logits: Nx1

which gathers all samples, shuffles the index, and reallocates mini-batch to each GPU.

github.com

facebookresearch/moco/blob/78b69cafae80bc74cd1a89ac3fb365dc20d157d3/moco/builder.py#L69-L94


def _batch_shuffle_ddp(self, x):
    """
    Batch shuffle, for making use of BatchNorm.
    *** Only support DistributedDataParallel (DDP) model. ***
    """
    # gather from all gpus
    batch_size_this = x.shape[0]
    x_gather = concat_all_gather(x)
    batch_size_all = x_gather.shape[0]

    num_gpus = batch_size_all // batch_size_this

    # random shuffle index
    idx_shuffle = torch.randperm(batch_size_all).cuda()

    # broadcast to all gpus
    torch.distributed.broadcast(idx_shuffle, src=0)

    # index for restoring
    idx_unshuffle = torch.argsort(idx_shuffle)

This file has been truncated. show original

I cannot infer a reason why only the record time of this shuffle operation has varied by measuring method.

tom · August 28, 2020, 6:59am

That might be a lot clearer if you

specify what exactly you want to measure,
time these bits in isolation,
break down the entire thing into the bits you want to measure (i.e. do your parts reconcile to the total? if not, where are overlaps or gaps in the parts).

The links don’t seem to show the actual measurement you inserted.

LeeDoYup · August 28, 2020, 8:17am

In the forward pass of moco, i am measuring the shuffling time, that is

    def forward(self, im_q, im_k):
        (....)
        # compute key features
        with torch.no_grad():  # no gradient to keys
            self._momentum_update_key_encoder()  # update the key encoder
            
            start_time = time.time()
            # shuffle for making use of BN
            im_k, idx_unshuffle = self._batch_shuffle_ddp(im_k)
            torch.cuda.synchronize()
            end_time = time.time()
            shuffle_time = end_time - start_time
            (...)

tom · August 28, 2020, 9:58am

So the immediate takeaway from the above discussion is

replace time.time with time.perf_counter()
have a torch.cuda.synchronize() before taking the the start_time,
maybe don’t take the first batch (of a given size)

LeeDoYup · September 1, 2020, 6:54pm

time.time()

i) time.time() w/o torch.cuda.synchronize()

shuffle time: 0.5993 s
inf_time: 0.1185 s

ii) time.time() w/o torch.cuda.synchronize() both before and after an operation

shuffle time: 0.0018 s
inf_time: 0.031 s

time.perf_counter()

i) use time.perf_counter() w/ torch.cuda.synchronize()

shuffle time: 0.0650 s
inf time: 0.0587 s

ii) use time.perf_counter() w/ torch.cuda.synchronize() both before and after an operation

shuffle time: 0.0021 s
inf time: 0.0309 s

time.process_time()

i) use time.process_time() w/ torch.cuda.synchronize()

shuffle time: 0.0879 s
inf time: 0.0584 s

ii) use time.process_time() w/ torch.cuda.synchronize() both before and after an operation

shuffle time: 0.001879 s
inf time: 0.03107 s

torch.cuda.Event

i) use torch.cuda.Event & torch.cuda.synchronize()

shuffle time: 2.72 ms
inf_time: 59.88 ms

Conclusions

When time profiling, we can use both time module in python and torch.cuda.Event.
What we remember is, we must use torch.cuda.synchronize() right before and after the operations to be profiled.
Measurements of time and cuda.Event is some what different, but the ratio is consistent.

LeeDoYup · September 1, 2020, 6:56pm

In the last,
can i ask some reasons why the time module and cuda.Event measure different elapsed times?
i am using V100x4 and DistributedDataParallel.

tom · September 1, 2020, 7:18pm

I must admit that I don’t know. One thing to exclude would be stochastic variation (e.g. %timeit gives you a standard deviation, so you can imagine error bars for the measurement), but I would not see 30ms->60ms doing that.
The other part is that you need a really stable environment to get reliable benchmarking, maybe something we did here changed something w.r.t. other things going on.
(But maybe @ptrblck knows something.)

Fun anecdote: A long time ago, I briefly enabled remote access to the GPU I used for benchmarking for one of fellow PyTorch devs because I somehow had a much more stable timing environment then them.

Best regards

Thomas