Generating a 2D matrix with 1s at particular positions

I have a 1D tensor s with elements in ascending order. I want to generate a 2D tensor a of shape (len(s)-1, s[-1]) such that its elements [i, s[i]:s[i+1]] are 1, all other elements being 0. How can I do this vectorially without using loops?

Example:

Input: s = torch.tensor([0,2,5,7])
Output:

a = torch.tensor([
    [1,1,0,0,0,0,0],
    [0,0,1,1,1,0,0],
    [0,0,0,0,0,1,1]
])

Loop implementation:

a = torch.zeros(len(s)-1,s[-1])
for i in range(len(a)):
    a[i, s[i]:s[i+1]] = 1

This is part of training a neural network, so I’m looking for a vector implementation.

In case it helps, this is actually part of a bigger problem where I am trying to sum unevenly-sized rows of a 2D tensor.

As an example:

inp = torch.tensor([
    [0,1],
    [2,3],
    [4,5],
    [6,7],
    [8,9],
    [10,11],
    [12,13]
])
s = torch.tensor([0,2,5,7])

I want my output to be of 3 rows, 0th row should sum rows 0:2 of inp, 1’th’ row should sum rows 2:5 of inp, and 2’th’ row should sum rows 5:7 of inp. So, the output should be:

out = torch.tensor([
    [2,4],
    [18,21],
    [22,24]
])

I realized I could do this if I can generate a as above, then out = a@inp.

I know I can solve the problem using scatter_add, but I don’t know how to construct the index parameter of scatter_add from s without using loops.

I think you will have to loop over the splits.
But here’s how it can be done:
Convert your s array to an array representing the length of each split. So s = [2-0, 5-2, 7-5] = [2,3,2]
Then use split, sum, stack as follows:
torch.stack([i.sum(0) for i in torch.split(inp, [2,3,2])])

Thanks, yeah that’s definitely a good way of doing it.

But I’m still wondering if there’s a non-loop way of doing it in time complexity O(1). Something like torch.sum(inp, splits=s) that would automatically sum over contiguous chunks of uneven sizes and collect them together.

Not the prettiest solution, but it does away with loops.

import torch

s=torch.tensor([0,2,5,7])
w=s.size()[0]-1

sp=torch.stack([torch.zeros(w),s[:-1],s[1:], torch.full((w,), s.max().item())]).T
sp = sp[:,1:]-sp[:,:-1]
sp = sp.int()

z=torch.stack([torch.cat([torch.zeros(sp[i][0].item()), torch.ones(sp[i][1].item()), torch.zeros(sp[i][2].item()) ]) for i in range(sp.size()[0])])
print(z)

EDIT: That should have been zero on the range(sp.size()[0]). Should work now. Please try again.

Here is a bigger test on the above method:

import torch

s=torch.arange(10000)*2
print(s)

tensor([ 0, 2, 4, …, 19994, 19996, 19998])

So the final size should be 19998 on dim 1.

w=s.size()[0]-1
sp=torch.stack([torch.zeros(w),s[:-1],s[1:], torch.full((w,), s.max().item())]).T
sp = sp[:,1:]-sp[:,:-1]
sp = sp.int()
print(sp)

tensor([[ 0, 2, 19996],
[ 2, 2, 19994],
[ 4, 2, 19992],
…,
[19992, 2, 4],
[19994, 2, 2],
[19996, 2, 0]], dtype=torch.int32)

z=torch.stack([torch.cat([torch.zeros(sp[i][0].item()), torch.ones(sp[i][1].item()), torch.zeros(sp[i][2].item()) ]) for i in range(sp.size()[0])])
print(z)

tensor([[1., 1., 0., …, 0., 0., 0.],
[0., 0., 1., …, 0., 0., 0.],
[0., 0., 0., …, 0., 0., 0.],
…,
[0., 0., 0., …, 0., 0., 0.],
[0., 0., 0., …, 1., 0., 0.],
[0., 0., 0., …, 0., 1., 1.]])

print(z.size())

torch.Size([9999, 19998])

Total run time was under 1 second on CPU.

Thank you, that’s a clever solution! I really appreciate you taking the time to write a test case.

I noticed that it still uses loops, i.e. for i in range(sp.size()[0]). I did a time comparison for 1) @rinkujadhav2013 's solution, vs 2) this. 1) is a bit faster.

In conclusion, it seems like loops are unavoidable for this problem :frowning:

Appreciate both of your efforts!

I think loops might be unavoidable cuz your splits are of uneven size. When doing matrix multiplications, dimensions need to align across things being parallelised.

Well, torch_scatter is built to handle splits of uneven size vectorially, provided one can provide the correct arguments to it. In my case, constructing the index argument to scatter_add needed a loop, that’s why I was looking at alternatives.

Torch scatter and gather use loops as far as I can tell.
Check the Cuda kernel here: pytorch/ScatterGatherKernel.cu at d7847ed23eb8880c7541319aab8864a45e239994 · pytorch/pytorch · GitHub
or the CPU kernels here:
pytorch/ScatterGatherKernel.cpp at 8f5f15a64b7f2efabfc9137ddc1ffb3390375626 · pytorch/pytorch · GitHub

Oh, that’s interesting. Probably explains why a torch scatter solution I came up with (with loops to construct the index matrix) took longer than your suggested torch.stack solution.

@souryadey when a for loop is contained within square brackets (i.e. [k for k in x]) it becomes a special asynchronous iterator in Python. Asynchronous means each step of k computes separately and doesn’t wait for other iters of k to compute first. So it’s effectively not a “for loop”, though still uses for.

My understanding is that something like [k for k in x] is a loop comprehension, which may be faster than a regular loop in certain settings. But it doesn’t give the performance of vector processing. This article has a good discussion.