Hello,

I’m interested in the following problem: given a tensor of A of size M consisting of my data, and a tensor B of size N consisting of ‘lengths’, I would like to apply a reduction operation over the first B[0] elements of A, then the next B[1] elements of A, etc.

Example:

A: [3, 6, 5, 2, 6, 1, -3, 7]

B: [2, 4, 1, 1]

Using a mean reduction, I want my output tensor to return:

Output: [4.5, 3.5, -3, 7]

There are various ways I can do this.

A simple for loop works but is very slow, as expected.

Using torch.scatter is much faster, but still feels like it is missing a potential speedup. In particular, scatter allows arbitrary ordering of indices; it doesn’t take advantage of the fact that my tensor B is going to be creating a series of ordered slices into A.

To illustrate this, if I try the reverse operation (e.g. broadcasting some tensor C out by repeating elements as many times as denoted by tensor B) via fancy indexing, this is much slower - particularly on the backwards pass - than using torch.repeat_interleave.

Note that I am performing all this on a GPU.

Does anyone have any ideas on how to make the above more performant? Does it require an explicit new kernel to be written, perhaps? I feel like this must be a useful feature to have in general, given that repeat_interleave exists.

Thanks!