Performant variable length reduction (inverse of repeat_interleave?)


I’m interested in the following problem: given a tensor of A of size M consisting of my data, and a tensor B of size N consisting of ‘lengths’, I would like to apply a reduction operation over the first B[0] elements of A, then the next B[1] elements of A, etc.


A: [3, 6, 5, 2, 6, 1, -3, 7]
B: [2, 4, 1, 1]

Using a mean reduction, I want my output tensor to return:

Output: [4.5, 3.5, -3, 7]

There are various ways I can do this.

A simple for loop works but is very slow, as expected.

Using torch.scatter is much faster, but still feels like it is missing a potential speedup. In particular, scatter allows arbitrary ordering of indices; it doesn’t take advantage of the fact that my tensor B is going to be creating a series of ordered slices into A.

To illustrate this, if I try the reverse operation (e.g. broadcasting some tensor C out by repeating elements as many times as denoted by tensor B) via fancy indexing, this is much slower - particularly on the backwards pass - than using torch.repeat_interleave.

Note that I am performing all this on a GPU.

Does anyone have any ideas on how to make the above more performant? Does it require an explicit new kernel to be written, perhaps? I feel like this must be a useful feature to have in general, given that repeat_interleave exists.


numpy.reduceat is a slightly more general version of what I would like to have; its equivalent doesn’t seem to exist in PyTorch.


I don’t think such function exists at the moment.
That being said we would be very happy to accept a PR if you want to add it :slight_smile:


Thanks for your response. We don’t know much about CUDA kernels but I think we’d be willing to read up about it some more; would you have some resources you can point us to? Alternatively, would you happen to know anyone who would be able to guide us in some way?

I think the PR for repeat_interleave that you can find here is a very good example. You would need to change exactely the same files to have the binding. Change the implementation for the cpp and cuda code and add a test.
Note that for the interface, I think an extra argument like reduce="sum" that can take value “sum” and “max” is good. Potentially add “mean” for convenience as well even though it can be done by dividing the result with the lenghts tensor.