# Efficient summation over distance tensor blocks

I have an N x N tensor, D (this is a symmetric distance matrix) the rows and columns of D are arranged in contiguous ‘groups’. The grouping index for each row/col in D is given as a 1D tensor, B where B[i] is the group (int) assigned to row i of D. I would like to efficiently take the sum of all pairs of slices of D induced by the grouping B. For example, with a distance matrix between 5 objects in two groups:

``````D = torch.tensor([[0., 1., 2., 3., 4.],
[1., 0., 1., 2., 3.],
[2., 1., 0., 1., 2.],
[3., 2., 1., 0., 1.],
[4., 3., 2., 1., 0.]])

# two groups, one of size 2 and one of size 3
B = torch.tensor([0, 0, 1, 1, 1])

# along each dim, the first group contains index 0 and 1
# and the second group has indices 2, 3, 4
# this makes 3 unique blocks:
# D[0:2:,][:,0:2], D[0:2, :][:,2:5], D[2:5,:][:,2:5]

>>> result
torch.tensor([2, 15, 8])
``````

The most efficient would seem to be to make use of `scatter_add_` but would appreciate an example of how to apply it here, or other guidance.

So scatter_add (and similar operations) when implemented with `atomicAdd` get slow when you have many “collisions”, i.e. GPU threads trying to write to the same address.
When you have contiguous blocks like these, it is feasible to write a kernel that uses that to be much more efficient, but likely the effort is quite high.
Personally, I’d probably double check what the performance is of `scatter_add` vs. calling `.sum()` on the blocks.

Best regards

Thomas

Thanks for this, I’ll definitely do a benchmarking. I would also appreciate an example of how to use scatter_add in this situation since I am quite new to this function…