# Strange speed of tensor.sum()

Hello, all.
I meet a problem that speed of function tensor.sum() is strange when calculate the same kind of Tensor.
I want to calculate the confusion matrix of a foreground/background segmentation model, and below is my testing code:

``````import time
import torch

def func(pr, gt):
dump = 0.0
for gt_i in range(2):
for pr_i in range(2):
num = (gt == gt_i) * (pr == pr_i)
start = time.time()
dump += num.sum()
print("Finding Time: {} {} {t:.4f}(s)".format(gt_i, pr_i, t=(time.time() - start)))

if __name__ == '__main__':
gt = torch.rand(1, 400, 400) > 0.5
gt = gt.int().cuda()

print(">>>>>>>>>>>>>>> Test One >>>>>>>>>>>>>>>")
prob1 = torch.rand(1, 2, 400, 400)
_, pr1 = prob1.topk(1, 1, True, True)
pr1 = torch.squeeze(pr1, 1)
pr1 = pr1.int().cuda()
print(type(pr1), pr1.size())
func(gt, pr1)

print(">>>>>>>>>>>>>>> Test Two >>>>>>>>>>>>>>>")
prob2 = torch.rand(1, 2, 400, 400)
prob2 = prob2.cuda()
_, pr2 = prob2.topk(1, 1, True, True)
pr2 = torch.squeeze(pr2, 1)
pr2 = pr2.int()
print(type(pr2), pr2.size())
func(gt, pr2)

``````

And the result is The speed of tensor.sum() of (gt == 0) * (pr == 0) in Test Two is very strange. However input tensor type of Test One and Test Two is the same.

I can not find the reason… Is there some hidden property of Tensor that i missed? Anyone can give some help?

Thanks

@apaszke I am seeking for help again… I would appreciate it if you can give some advice. THANKS!!!

You need to insert proper synchronization, because the GPU runs asynchronously (unless you e.g. launch a CPU <-> GPU copy). There should be a few posts on this forums that should how to do it (search for `torch.cuda.synchronize()`).

``````import time
import torch

def func(pr, gt):
dump = 0.0
for gt_i in range(2):
for pr_i in range(2):
num = (gt == gt_i) * (pr == pr_i)
# Make sure you don't have anything still running
torch.cuda.synchronize()
start = time.time()
dump += num.sum()
# Make sure everything has been done
torch.cuda.synchronize()
print("Finding Time: {} {} {t:.4f}(s)".format(gt_i, pr_i, t=(time.time() - start)))

if __name__ == '__main__':
gt = torch.rand(1, 400, 400) > 0.5
gt = gt.int().cuda()

print(">>>>>>>>>>>>>>> Test One >>>>>>>>>>>>>>>")
prob1 = torch.rand(1, 2, 400, 400)
_, pr1 = prob1.topk(1, 1, True, True)
pr1 = torch.squeeze(pr1, 1)
pr1 = pr1.int().cuda()
print(type(pr1), pr1.size())
func(gt, pr1)

print(">>>>>>>>>>>>>>> Test Two >>>>>>>>>>>>>>>")
prob2 = torch.rand(1, 2, 400, 400)
prob2 = prob2.cuda()
_, pr2 = prob2.topk(1, 1, True, True)
pr2 = torch.squeeze(pr2, 1)
pr2 = pr2.int()
print(type(pr2), pr2.size())
func(gt, pr2)
``````
1 Like

@apaszke @albanD That’s really the reason. Thank you very much!