Execution time slowed done when using an if statement

I noticed that there is a weird slow down after using an if statement in my code. I load an image onto CUDA device, then my neural network (fixed parameters) detects if there is an object or not in the given image. If there is an object, pixel values take values different from zero in the corresponding region, otherwise 0. I must send a signal if there is a non-zero value in the predicted output. I’m doing it by using an if statement. Execution time slows down drastically when using this if statement. Below is my code:

torch.cuda.synchronize()
i=0
while i < 10: 
    with torch.no_grad():
        a = time.perf_counter()
        image_i = torch.from_numpy(dataset).float().cuda()/255.0
        pred = torch.argmax(network(image_i)["seg"][0] , dim=0)
        if pred.byte().any()>0:
            b = time.perf_counter()
        print(b-a)
        torch.cuda.synchronize()
        
        i=i+1

2.8723719230001734 seconds
2.821866113000169 seconds
2.8291808970000147 seconds
2.806728226000132 seconds
2.804821959000037 seconds
2.8151050120000036 seconds
2.808966260000034 seconds
2.847957038000004 seconds
2.812290454000049 seconds
2.826942403999965 seconds

If I remove the if statement, the code looks like as below:

torch.cuda.synchronize()
i=0
while i < 10: 
    with torch.no_grad():
        a = time.perf_counter()
        image_i = torch.from_numpy(dataset).float().cuda()/255.0
        pred = torch.argmax(network(image_i)["seg"][0] , dim=0)
        pred.byte().any()>0
        b = time.perf_counter()
        print(b-a)
        torch.cuda.synchronize()
        
        i=i+1

0.011929868999914106 seconds
0.00671789400007583 seconds
0.009328374000006079 seconds
0.006993827000087549 seconds
0.008924279999973805 seconds
0.008238326999844503 seconds
0.010348931999942579 seconds
0.00666478800008008 seconds
0.008329585999945266 seconds
0.0066920950000621815 seconds

As you can see the problem is really in the if branch.

Could you give an explanation to this if there is one?

It looks rather like a synchronization issue.
In your second example, pred.byte().any() > 0 would be computed on the GPU, thus it can be just enqueued and your second timer b will be immediately called without waiting.

The if condition should be executed on the CPU by the Python runtime, this it would create a synchronization point.

1 Like

Thank you very much for your fast reply. Yet, I’m still confused. Am I measuring the time in an incorrect way or is sending signal conditioned on if, really taking 3 seconds while my network predicts in less than 2 miliseconds? If so, is there a way to execute this if statement on GPU?

You are most likely just measuring the kernel launch times in your second code snippet.
To properly time a segment, you would have to synchronize before starting and stopping the timer.
E.g. this code should show a high duration in the actual forward pass:

while i < 10: 
    with torch.no_grad():
        image_i = torch.from_numpy(dataset).float().cuda()/255.0
        torch.cuda.synchronize()
        a = time.perf_counter()
        pred = torch.argmax(network(image_i)["seg"][0] , dim=0)
        torch.cuda.synchronize()
        b = time.perf_counter()
        pred.byte().any()>0
        print(b-a)
        
        i=i+1

This was exactly the case. Thank you very much for your guidance!

for segmentation how to find predication time or where to use torch.cuda.aysnchronize to - time.percounter