Why is looping through pytorch tensors so slow (compared to Numpy)?

lakshjaisinghani · October 1, 2020, 12:18am

I’ve been working with image transformations recently and came to a situation where I have a large array (shape of 100,000 x 3) where each row represents a point in 3D space like:

pnt = [x y z]

All I’m trying to do is iterating through each point and matrix multiplying each point with a matrix called T (shape = 3 X 3).

Test with Numpy:

def transform(pnt_cloud, T):
    
    depth_array = np.zeros(pnt_cloud.shape[0])
    
    i = 0
    for pnt in pnt_cloud:
        xyz_pnt = np.dot(T, pnt)
        
        if xyz_pnt[0] > 0:
            depth_array[i] = xyz_pnt[0]
            
        i += 1
            
        
    return depth_array

Calling the following function and calculating runtime (using %time) gives the output:

Out[190]: CPU times: user 670 ms, sys: 7.91 ms, total: 678 ms
Wall time: 674 ms

Test with Pytorch Tensor:

import torch

tensor_cld = torch.tensor(pnt_cloud)
tensor_T   = torch.tensor(T)

def transform(pnt_cloud, T):

    depth_array = torch.tensor(np.zeros(pnt_cloud.shape[0]))

    i = 0
    for pnt in pnt_cloud:
        xyz_pnt = torch.matmul(T, pnt)
        
        if xyz_pnt[0] > 0:
            depth_array[i] = xyz_pnt[0]
            
        i += 1
            
        
    return depth_array

Calling the following function and calculating runtime (using %time) gives the output:

Out[199]: CPU times: user 6.15 s, sys: 28.1 ms, total: 6.18 s
Wall time: 6.09 s

I would have thought that PyTorch tensor computations would be much faster due to the way PyTorch breaks its code down in the compiling stage. What am I missing here?

Other things I’ve tried:

Doing the same with torch.jit only reduces 2s
tried torch.no_grad() as I thought I was accumulating gradients (I realized that’s not how it works)
Numba + Numpy jit works the fastest (120ms)

tom · October 1, 2020, 7:07am

Yeah, so two things

operations of 1-3 elements are generally rather expensive in PyTorch as the overhead of Tensor creation becomes significant (this includes setting single elements), I think this is the main thing here. This is also the reason why the JIT doesn’t help a whole lot (it only takes away the Python overhead) and Numby shines (where e.g. the assignment to depth_array[i] is just a memory write).
the matmul itself might differ in speed if you have different BLAS backends for it in PyTorch vs. NumPy.

In this specific case, you could likely just do depth_array = torch.matmul(pnt_cloud, T.t()).clamp_(min=0) or so. Similarly in numpy.

Best regards

Thomas

(PS: it would help if you just added dummy data to your code examples so people could just copypaste to look into the benchmarking.)