Time consuming of GPU model

linyu · August 2, 2018, 10:39am

Hi, I want to know, when I would like to measure the time of a image through the GPU model, which one is correct?

1, start = time.time()
result = vgg_gpu_model(image)
end = time.time()
2, torch.cuda.synchronize()
start = time.time()
result = vgg_gpu_model(image)
torch.cuda.synchronize()
end = time.time()

I find that, the time result of method 1 and method 2 is very different, but the output result is same.What’s the difference between them?
Thanks for your attention.

albanD · August 2, 2018, 11:38am

Hi,

is the correct way to do this.

linyu · August 2, 2018, 11:45am

I wonder that, what’s the difference between them? According to 1, I also can get the correct result.

linyu · August 2, 2018, 11:45am

I wonder that, what’s the difference between them? According to 1, I also can get the correct result

albanD · August 2, 2018, 11:51am

The main problem is that all cuda operations are asynchronous.
This means that if you don’t explicitly synchronize, the time that you measure could actually correspond to the time it took to perform other operations that we asked for before you started your timing.
What 2 is doing that 1 does not is that it makes sure that nothing is still running on the gpu before your start running your own code. And then it makes sure that all the operations you asked for are finished before you end your timer.

linyu · August 2, 2018, 12:17pm

Thank you for your apply.
Actually, when I use the method 1, it takes 5 ms, while I use the method 2, it takes 300 ms. I wonder what’s the extra 295 ms does? I feel, during 5 ms, the operation should has been OK.

albanD · August 2, 2018, 12:19pm

No a vgg takes in the order of hundreds of ms to run a forward pass.
5ms is how long it takes to run the python code + queue up all the cuda operations + start the execution.

linyu · August 2, 2018, 12:23pm

However, if I do not use the torch.cuda.synchronize(), after 5 ms, my code is end and return the result, this should prove that I have got what I want.

Here is my code:

import time
import torch

def do_it(X,kh,kw):
  if not X.is_cuda:
    raise Exception,"X is not in GPU!"
  if not len(X.shape) == 3:
    raise Exception,"The shape of X is error!"
  C, H, W = X.shape
  X_unfold = X.unfold(1,kh,1).unfold(2,kw,1)
  X_unfold = X_unfold.permute(1,2,0,3,4)
  shape_ = X_unfold.shape
  sample_num = shape_[0]*shape_[1]
  sample_dim = shape_[2]*shape_[3]*shape_[4]
  Y = X_unfold.reshape(sample_num, sample_dim)
  K = torch.mm(Y,Y.t())
  return Y, K

if __name__ == '__main__':
  X = torch.rand(512,60,60).cuda()
  Y, K = do_it(X,15,15)
  torch.cuda.synchronize()
  start = time.time()
  for i in range(10):
    _, K = do_it(X,15,15)
  torch.cuda.synchronize()
  end = time.time()
  print(end-start)

albanD · August 2, 2018, 12:43pm

It returns but these tensors are not readily available. This is all done transparently by pytorch, but if for example you try to print one of the value of the output, it will take the 300ms, because to get the value you need to wait for it to be computed.
What pytorch does is pass around tensors without knowing their content and you only wait for the content when you don’t have any other choice. Which is mostly when the value is required on the cpu side.

linyu · August 2, 2018, 12:47pm

OK, thank you, I see it.