Big variation in runtime when using for loop vs broadcasting on GPU

I am noticing that runtime for loops is very high compared to broadcasting pytorch.

Below, I am posting code snippet to explain my point:
I am finding the mse for 100,000 random datapoints using for loop vs using broadcasting using cuda.
There is big difference in running time ~25s vs ~0.0006s.

What explains this big difference? How can I bring down the ~25 sec runtime?

Replacing torch with numpy in below snippets gives ~0.25s vs ~0.0075s

import torch
import timeit
#Processing mse of 100000 data points thru a loop using pytorch cuda

def torch_loop_cuda_test():
    error = torch.tensor([0.], device = 'cuda')
    for _ in range(100000):        
        value, output = torch.rand(1,device='cuda'), torch.rand(1,device='cuda') 
        error_val = value - output
        error += torch.pow(error_val,2)

print(timeit.timeit(stmt = 'torch_loop_cuda_test()', 
                    globals = globals(), 
                    number = 1))

output: 24.775256000000013

#Processing mse of 100000 data points thru pytorch cuda broadcasting

def torch_vector_cuda_test():
    values  = torch.rand(100000, device = 'cuda') 
    output = torch.rand(100000, device = 'cuda')
    error = values - output
    error = torch.pow(error, 2)

print(timeit.timeit(stmt = 'torch_vector_cuda_test()', 
                    globals = globals(), 
                    number = 1))

output: 0.0006479000001036184

The difference between the code snippets seems to be the sequential vs. parallel execution.
Each CUDA operation will launch a kernel, which will create some overhead.
This overhead of the kernel launch is often not visible, if the actual workload on the GPU is large.

However, in your first example your workload is tiny (two scalar tensors are used), while you are triggering 100k CUDA operations, which would create a large overhead.

That being said, you should synchronize the code via torch.cuda.synchronize() before starting and stopping the timer. In your second example especially you might only profile the actual kernel launch, not the complete operation.

1 Like

Thank you patrick for the reply.
Is there any other way to reduce the runtime for loops that are run inside gpu?

The usual approach would be to use a batched/vectorized operation, i.e. to try to get rid of the Python loop.