Time consuming of GPU model

Hi, I want to know, when I would like to measure the time of a image through the GPU model, which one is correct?

1, start = time.time()
result = vgg_gpu_model(image)
end = time.time()
2, torch.cuda.synchronize()
start = time.time()
result = vgg_gpu_model(image)
end = time.time()

I find that, the time result of method 1 and method 2 is very different, but the output result is same.What’s the difference between them?
Thanks for your attention.


  1. is the correct way to do this.

I wonder that, what’s the difference between them? According to 1, I also can get the correct result.

I wonder that, what’s the difference between them? According to 1, I also can get the correct result

The main problem is that all cuda operations are asynchronous.
This means that if you don’t explicitly synchronize, the time that you measure could actually correspond to the time it took to perform other operations that we asked for before you started your timing.
What 2 is doing that 1 does not is that it makes sure that nothing is still running on the gpu before your start running your own code. And then it makes sure that all the operations you asked for are finished before you end your timer.

Thank you for your apply.
Actually, when I use the method 1, it takes 5 ms, while I use the method 2, it takes 300 ms. I wonder what’s the extra 295 ms does? I feel, during 5 ms, the operation should has been OK.

No a vgg takes in the order of hundreds of ms to run a forward pass.
5ms is how long it takes to run the python code + queue up all the cuda operations + start the execution.

However, if I do not use the torch.cuda.synchronize(), after 5 ms, my code is end and return the result, this should prove that I have got what I want.

Here is my code:

import time
import torch

def do_it(X,kh,kw):
  if not X.is_cuda:
    raise Exception,"X is not in GPU!"
  if not len(X.shape) == 3:
    raise Exception,"The shape of X is error!"
  C, H, W = X.shape
  X_unfold = X.unfold(1,kh,1).unfold(2,kw,1)
  X_unfold = X_unfold.permute(1,2,0,3,4)
  shape_ = X_unfold.shape
  sample_num = shape_[0]*shape_[1]
  sample_dim = shape_[2]*shape_[3]*shape_[4]
  Y = X_unfold.reshape(sample_num, sample_dim)
  K = torch.mm(Y,Y.t())
  return Y, K

if __name__ == '__main__':
  X = torch.rand(512,60,60).cuda()
  Y, K = do_it(X,15,15)
  start = time.time()
  for i in range(10):
    _, K = do_it(X,15,15)
  end = time.time()

It returns but these tensors are not readily available. This is all done transparently by pytorch, but if for example you try to print one of the value of the output, it will take the 300ms, because to get the value you need to wait for it to be computed.
What pytorch does is pass around tensors without knowing their content and you only wait for the content when you don’t have any other choice. Which is mostly when the value is required on the cpu side.

OK, thank you, I see it.