Calculate kernel by blocks

Hi all, I’m currently have a tensor with side BxNxD (batch size x num sample each block x dimension)
Can anybody come across give me a help, please? I want to speed up the following computation:

K = Kernel() # similar with https://github.com/activatedgeek/svgd/blob/master/rbf.py
sum = 0
for i in range(B):
    sum += K(tensor[i], tensor[i]) 

It calculate kernel of a NxD tensor (which returns a NxN tensor) and then sum over the minibatch

@leo2k This sounds more like a process bottleneck. Recommend the usage of multiprocessing pool to speed up

https://docs.python.org/3/library/multiprocessing.html

if you are up for adventure, you can use the cuda cores to do this calculation. This will be blazing fast

can anyone give me a help in implementing this, please?

[Bump up]
can anyone give me a help in implementing this, please?