I am noticing that runtime for loops is very high compared to broadcasting pytorch.
Below, I am posting code snippet to explain my point:
I am finding the mse for 100,000 random datapoints using for loop vs using broadcasting using cuda.
There is big difference in running time ~25s vs ~0.0006s.
What explains this big difference? How can I bring down the ~25 sec runtime?
Replacing torch with numpy in below snippets gives ~0.25s vs ~0.0075s
import torch
import timeit
#Processing mse of 100000 data points thru a loop using pytorch cuda
def torch_loop_cuda_test():
error = torch.tensor([0.], device = 'cuda')
for _ in range(100000):
value, output = torch.rand(1,device='cuda'), torch.rand(1,device='cuda')
error_val = value - output
error += torch.pow(error_val,2)
print(timeit.timeit(stmt = 'torch_loop_cuda_test()',
globals = globals(),
number = 1))
output: 24.775256000000013
#Processing mse of 100000 data points thru pytorch cuda broadcasting
def torch_vector_cuda_test():
values = torch.rand(100000, device = 'cuda')
output = torch.rand(100000, device = 'cuda')
error = values - output
error = torch.pow(error, 2)
torch.sum(error)
print(timeit.timeit(stmt = 'torch_vector_cuda_test()',
globals = globals(),
number = 1))
output: 0.0006479000001036184