Strided sum without having to loop

Hey there is there an efficient way to do a strided sum in pytorch? Particularly when the number of elements that fall under each stride is variable but specified? For example:

a = torch.tensor([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
stride = torch.tensor([3, 3, 4])
result = torch.strided_sum(a, stride) (something like this)

Meaning, I want to sum the first 3 elements, the next 3 and the last 4 the resulting tensor would be

result = torch.tensor([(1 + 2 + 3), (4 + 5 + 6), (7 + 8 + 9 + 10)]) = torch.tensor([6, 15, 34])

I would like to do this without having to loop as that would make the function too slow. Is this possible?

You can use scatter_add_ to accumulate these values to a new tensor.
Assuming you have already calculated the stride tensor, you would need to create the index tensor from it as seen here:

a = torch.tensor([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
stride = torch.tensor([3, 3, 4])
idx = torch.tensor(sum([[i]*s for i, s in zip(range(stride.size(0)), stride)], []))

out = torch.zeros(stride.size(0), dtype=a.dtype).scatter_add_(0, idx, a)
> tensor([ 6, 15, 34])

@ptrblck Is there a way to do this for other reduction operations? For instance if I wanted to apply a scatter_max or scatter_softmax?

You could take a look at the scatter methods from rusty1s/pytorch_scatter.

I’m running into an strange error with CUDA and Pytorch when I try to use torch_scatter.

RuntimeError: nvrtc: error: failed to open
  Make sure that is installed correctly.
nvrtc compilation failed:

Would you happen to know why this is happening? I’m running pytorch 1.8.1 + cu111

This seems to be a known issue with the CUDA11.1 pip wheels as described here, so you might need to use the conda binaries, a source build, or the CUDA10.2 pip wheels.