Calculate kernel by blocks

Hi all, I’m currently have a tensor with side BxNxD (batch size x num sample each block x dimension)
Can anybody come across give me a help, please? I want to speed up the following computation:

K = Kernel() # similar with
sum = 0
for i in range(B):
    sum += K(tensor[i], tensor[i]) 

It calculate kernel of a NxD tensor (which returns a NxN tensor) and then sum over the minibatch

@leo2k This sounds more like a process bottleneck. Recommend the usage of multiprocessing pool to speed up

if you are up for adventure, you can use the cuda cores to do this calculation. This will be blazing fast

can anyone give me a help in implementing this, please?

[Bump up]
can anyone give me a help in implementing this, please?