hello! I wanna do this in pytorch A=B.mm(B.t()) to get the gram matrix, in fact the result is a symmetric matrix, which mean I actually can reduce the operations by half to get only the upper triangular matrix, how to do it without using loops? cuz using for loop is extremly slow in torch

using triu_indices/tril_indices is extremely slow too(e.g. (64,64,300,300)), please give an answer without using indices and for loop, thanks!

Your only hope of beating mm would be to use a specialized low level function - that’s herk from BLAS (cublas or cpu version) I believe. You’d need to write a C++ extension to have python interface & gradients.

Even if you go that route, don’t expect 2x speedup, as arithmetic instructions themselves are cheap, while memory latency and other overheads won’t change.

Thanks for your answer! Where can I find the source code for mm? Maybe I can modify there directly if possible, just wanna change

for i in range(100):

for j in range(100):

to

for i in range(100):

for j in range(i):

No, you won’t be able to modify mm if you think it is about such loops. I’ll point you to sources, but this is just to give you an idea

addmm_out_cuda_impl

addmm_impl_cpu_

note that there are like 5-10 wrappers above these routines in ATen (and mm dispatches to addmm there), and they still dispatch to an external blas library (that will process avx/cuda blocks, instead of a loop like that)

PS if you’re using CUDA, amortization from parallelism may make your efforts have only a minor effect.