# Why is looping through pytorch tensors so slow (compared to Numpy)?

I’ve been working with image transformations recently and came to a situation where I have a large array (shape of 100,000 x 3) where each row represents a point in 3D space like:

``````pnt = [x y z]
``````

All I’m trying to do is iterating through each point and matrix multiplying each point with a matrix called T (shape = 3 X 3).

## Test with Numpy:

``````def transform(pnt_cloud, T):

depth_array = np.zeros(pnt_cloud.shape[0])

i = 0
for pnt in pnt_cloud:
xyz_pnt = np.dot(T, pnt)

if xyz_pnt[0] > 0:
depth_array[i] = xyz_pnt[0]

i += 1

return depth_array
``````

Calling the following function and calculating runtime (using %time) gives the output:

``````Out[190]: CPU times: user 670 ms, sys: 7.91 ms, total: 678 ms
Wall time: 674 ms
``````

## Test with Pytorch Tensor:

``````import torch

tensor_cld = torch.tensor(pnt_cloud)
tensor_T   = torch.tensor(T)

def transform(pnt_cloud, T):

depth_array = torch.tensor(np.zeros(pnt_cloud.shape[0]))

i = 0
for pnt in pnt_cloud:
xyz_pnt = torch.matmul(T, pnt)

if xyz_pnt[0] > 0:
depth_array[i] = xyz_pnt[0]

i += 1

return depth_array
``````

Calling the following function and calculating runtime (using %time) gives the output:

``````Out[199]: CPU times: user 6.15 s, sys: 28.1 ms, total: 6.18 s
Wall time: 6.09 s
``````

I would have thought that PyTorch tensor computations would be much faster due to the way PyTorch breaks its code down in the compiling stage. What am I missing here?

Other things I’ve tried:

1. Doing the same with torch.jit only reduces 2s
2. tried torch.no_grad() as I thought I was accumulating gradients (I realized that’s not how it works)
3. Numba + Numpy jit works the fastest (120ms)
1 Like

Yeah, so two things

• operations of 1-3 elements are generally rather expensive in PyTorch as the overhead of Tensor creation becomes significant (this includes setting single elements), I think this is the main thing here. This is also the reason why the JIT doesn’t help a whole lot (it only takes away the Python overhead) and Numby shines (where e.g. the assignment to depth_array[i] is just a memory write).
• the matmul itself might differ in speed if you have different BLAS backends for it in PyTorch vs. NumPy.

In this specific case, you could likely just do `depth_array = torch.matmul(pnt_cloud, T.t()).clamp_(min=0)` or so. Similarly in numpy.

Best regards

Thomas

(PS: it would help if you just added dummy data to your code examples so people could just copypaste to look into the benchmarking.)

1 Like