Vectorization of a multiply function mymult(num1,num2) and myadd(num1,num2)

Convolution operation can be converted to matrix multiplication using [1] [2] and then you can use torch.matmul() . My question is i have to replace the addition and multiplication with my own functions mymult(num1,num2) and myadd(num1,num2). Currently i am using loops to replace torch.matmul() and multiply and add element wise which is really really slow. I would like to somehow make it vectorize or use broadcasting i am not sure how to do that. In short i would like to make it faster.

Where are these loops added in your code?
Once you have created the patches a vectorized code should be be doable.
Are these loops in your custom multiplication and addition?

Thank you very much for your response. Let me share a code snippet of the forward function.

#def mymult(num1,num2):
      #return num1 somebooleanoperations num2    #returns num1 * num2


def convcheck():
        torch.manual_seed(1)
        batch_size = 2
        channels = 1
        
        h, w = 4,4
        image = torch.randn(batch_size, channels, h, w) # input image

        
        out_channels = 2
        kh, kw = 2, 2# kernel size
        dh, dw = 1, 1 # stride
        size = int((h-kh+2*0)/dh+1)    
 
        conv = nn.Conv2d(in_channels=channels, out_channels=out_channels, kernel_size=kw, padding=0,stride=dh )
        
        out = conv (image)
        filt = conv.weight.data 
        bias = conv.bias.data
        imageunfold = F.unfold(image,kernel_size=kh,padding=0,stride=dh)
        
        kernels_flat = filt.view(out_channels,-1)
   
        res =  torch.matmul(kernels_flat,imageunfold) # == imageunfoldkernels_flat @ imageunfold



       
        result1 = torch.zeros(batch_size,kernels_flat.size(0),imageunfold.size(2))


        for m_batch in range(len(imageunfold)):
                #iterate through rows of X  
            for i in range(kernels_flat.size(0)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ):
                # iterate through columns of Y
                for j in range(imageunfold.size(2)):                   
                    # iterate through rows of Y
                    for k in range(imageunfold.size(1)):              
                      
                        result1[m_batch][i][j]  +=   mymult(float(kernels_flat[i][k]),float(imageunfold[m_batch][k][j])))  # kernels_flat[i][k] * imageunfold[m_batch][k][j]
                        
     
# Upper three loops can they be vectorized or made fast
      
#out , res and result all three are equivalent


        res = res.view(-1, out_channels, size, size)
    
        result1 = result1.view(-1, out_channels, size, size)



        bias = bias.unsqueeze(1).unsqueeze(2)


        if bias is not None:
            result1 += bias.unsqueeze(0)                  
            res += bias.unsqueeze(0)

I would like the above loops to act parallel or fast not sure what term to use here. I really appreciate your time and effort you put into the forums.

Thanks for the code!
What is mymult applying internally (if you don’t want to share it due to research etc., it’s OK)?
If you are using some PyTorch methods internally, they should be able to use tensors (or batches of tensors) instead of scalar values.
On the other hand, if you are using some other (unsupported operations), your best option might be to write a custom C++/CUDA extension and vectorize the code manually.

1 Like

Thanks, i rewrote the code using tensored version and it is very fast as compared to the loopy one.

Now mymult(num1,num2) takes two tensors num1 and num2 which are then converted to numpy to perform unsupported operations and then return the result :slight_smile: Thanks a ton.

We have an approximate multiplier now. and I have same problem with you. I want to custom torch.matmul by using our mulitplication. Could you tell me how you solve this problem? thanks a lot

You have to write your own matmul function which uses your approximate multiplication instead of original multiplication i.e --> *

I have used numpy to get what i wanted. Rest of the code is already available in this post.

1 Like

Hey folks, we need to custom the adder and multiplier as well. We are currently trying the Sami’s code snippet (loopy version) and we are suffering due to the extremely long run time. As Sami mentioned above, it would be faster to use numpy instead of the loopy version. Do we still need to a forloop to unloop kernels_flat and imageunfold? Some advice would help us a lot. Thanks!

If you can convert your multiplication and addition to numpy/torch then you dont need these pythonic loops. Easiest way is to represent your adder multiplier using numpy/torch builtin functions. And use above code snippet. You might have to play with the shape of the tensors a little bit, but it definitely possible and fast.

Hi Sami, thanks for your reply. Yeah we think the fastest way is to use numpy to implement our custom adder/multiplier. Your code snippet is very helpful. We can use the use the matrix multiplication to do what we want now. But that’s not enough, because we want to build our own adder/multiplier, cannot use their existing matmul method directly.
Then we try to use the add and mul from the numpy to implement our custom adder/multiplier. By doing this, we found that we need to use at least one loop as the data in one col/row need to be copied and used in many multiplications. Using the loop makes it slow again. Do you have any insight about this?
Thanks again!

1 Like

Python loops are slow as compared to underneath c loops. If you cannot get rid to python loop may be use cython there for that specific loop. I agree with you sometimes it’s hard to get what you want from the builtin functions. I hope it helps