Speed is different with same functionality

I tested to create cost volume(stereo depth estimation)
then I tried to evaluate 2 method below
what makes time that differ?

        start_full_time = time.time()
	B, C, H, W = refimg_fea.shape
	cost = refimg_fea.new_zeros([B, 2*C, self.maxdisp//4, H, W], requires_grad=False)
        print('time = %.4f [s]' %((time.time() - start_full_time)))
	print(cost.device)

time = 0.0000 [s]
cuda:0

        start_full_time = time.time()
	B, C, H, W = refimg_fea.shape
	cost = torch.FloatTensor(B, C*2, self.maxdisp//4,  H,  W).cuda()
        print('time = %.4f [s]' %((time.time() - start_full_time)))
	print(cost.device)

time = 0.1875 [s]
cuda:0

my guess is first code is createing tensor in cuda
secound code is creating tensor in cpu and move to cuda
so moving to cuda consuming operation time

if yes, is there any way to directly create tensor in cuda?

The reasons could be different code paths taken by the tensor constructors.
However, since CUDA operations are asynchronous, you would need to synchronize the code before starting and stopping the timer.
The torch.utils.benchmark utilities can be helpful, as they are synchronizing and add warmup iterations.

thank you for your reply

Let me confirm
what you are saying is timer is CPU and tensor process is GPU, so it doesn’t synchronize. that’s why I need to synchronize?

thank you in advance

Yes, you would be profiling the CPU calls and kernel launch, not the actual data transfer or any workload on the GPU as it’s executed asynchronously and the CPU can run ahead and stop the timer, while the GPU is still busy.

1 Like