Different time consumption of the same operation

Hi, I have encountered a problem that relates to the time consumption of the indexing operation.
I have a tensor which is ‘output:[B,C,H,W]’ and I use unfold to transform it like ‘out_patch:[B,C,H//N,W//N,N,N]’. N is the patch size.
And there is a binary mask tensor ‘mask:[B,C,H//N,W//N]’ so that I can select patches from out_patch by: ‘patch=out_patch[mask]’.
There is a similar operation like ‘patch=out_patch[mask]’ in the latter part of the code but the time it consumed is nearly 100 times of the former one. I have tested many times but the problem still occurs.

Later, I found the problem was that there are some nn.conv2d operations running before the indexing operation and the whole code is running on GPU. If I put the indexing operation before the convolutions or run the code on CPU, the time would be the same as the former one.

I wonder what causes this phenomenon and is there any way to avoid it? As in my method, the latter indexing operation has to be put behind the convolutions and the code has to be run on GPU. Running Time is very important to me. Thanks.

Hello, is there anyone who could help with it? : (

CUDA operations are executed asynchronously and based on the description of the issue I guess you might not synchronize the code before starting and stopping the timers, which would return a wrong timing profile (e.g. only the kernel launches might be profiled).
Synchronize the code via torch.cuda.synchronize() before starting/stopping the timers, add warmup iterations, and in the best case calculate the average time of multiple iterations to stabilize the profiling.

Hi, thanks so much for your response. I have revised my code based on your suggestion and I found it seems not to be the problem of the location of the indexing operation, but the operation of torch.nonzero takes up too much time. As convolution is highly sped up by GPU, operations like torch.nonzero seem to be less efficient when the code is running on GPU and this single operation takes almost half of the time of the whole inference process. I wonder whether this observation is normal and is there any way to reduce the time of these operations. Thanks.

The nonzero() operation is synchronizing the code, so unless you’ve added the manual synchronizations into your code as suggested, your timings will be wrong and the nonzero() operation will accumulate the execution time of the asynchronously launched CUDA kernels.

Actually I have added the synchronizations, but I am not sure whether I have done it right:

t_nonzero = []
class net(nn.Module):
   def forward():
   def inferece():
        t1 = 1000*time.time()
        idx = torch.nonzero(mask.squeeze(1))
        t2 = 1000*time.time()

if __name__ == "__main__":
     ## warmup
     for _ in range(100):
         out = net.inference(input, ratio, patchsize)
     t0 = 1000 * time.time()
     n = 1000
     for _ in range(n):
         out = net.inference(input, ratio, patchsize)
     t1 = 1000 * time.time()
     fps = 1000 * n / (t1 - t0)


Could you please help me with it? Thanks a lot.

You are synchronizing before some timers, but skip it inside the inference operation.
What could happen is:

torch.cuda.synchronize() # valid sync, CPU waits until GPU is ready
t0 = 1000 * time.time() # start timer
out = net.inference(input, ratio, patchsize) # execute inference
    # inside inference
    ...... # execute model forward pass
    t1 = 1000*time.time() # start new timer without waiting for the GPU to finish the work
    idx = torch.nonzero(mask.squeeze(1)) # nonzero synchronizes, CPU waits
    t2 = 1000*time.time() # stop timer, which shows the GPU forward time + nonzero op

Hi, if I am understanding correctly, what you mean is that torch.nonzero operation indeed takes up some time but my measurement of that operation is wrong and it didn’t take that lot of time. I deleted the torch.nonzero operation and run the inference process again, finding that there are just a few decreases in the whole inference time which means that torch.nonzero is not that time-consuming as I expected. But I believe these indexing operations should be less effective than convolutions when the code is running on GPU. So I guess real-time models should avoid these operations as possible.

Yes, the nonzero would need some time to be executed, but you are right that your profiling is wrong as it accumulates the model forward time into the nonzero operation, so it looks as if the nonzero op is more expensive than it is.
As mentioned before, this operation is synchronizing, so should be avoided for this reason (in case you want to remove all syncs, if possible).

I would claim it depends on the actual workflow, so you could profile it with synchronizations and compare the runtimes, if needed.

Thanks so much for your reply! It helps me a lot.