I’m interested in the following problem: given a tensor of A of size M consisting of my data, and a tensor B of size N consisting of ‘lengths’, I would like to apply a reduction operation over the first B elements of A, then the next B elements of A, etc.
A: [3, 6, 5, 2, 6, 1, -3, 7]
B: [2, 4, 1, 1]
Using a mean reduction, I want my output tensor to return:
Output: [4.5, 3.5, -3, 7]
There are various ways I can do this.
A simple for loop works but is very slow, as expected.
Using torch.scatter is much faster, but still feels like it is missing a potential speedup. In particular, scatter allows arbitrary ordering of indices; it doesn’t take advantage of the fact that my tensor B is going to be creating a series of ordered slices into A.
To illustrate this, if I try the reverse operation (e.g. broadcasting some tensor C out by repeating elements as many times as denoted by tensor B) via fancy indexing, this is much slower - particularly on the backwards pass - than using torch.repeat_interleave.
Note that I am performing all this on a GPU.
Does anyone have any ideas on how to make the above more performant? Does it require an explicit new kernel to be written, perhaps? I feel like this must be a useful feature to have in general, given that repeat_interleave exists.