Count number occurrence of value per row

enemis · November 16, 2021, 4:49pm

Hi,
I havent found exactly this problem before, and keen if anyone has any ideas. I have a 2d matrix x containing integer values. I want to count how many of each value per row and return a new 2d matrix num_of_value that has the same number of rows, but the columns represent different values. So in num_of_value[0,1] is the number of occurences of 1 in row=0 and so on. My best solution so far is a loop, which might be inefficient for large number of unique values.

Any ideas?

#%%
x =  torch.tensor([[1, 2, 3, 4, 2, 5, 6, 7, 8, 9, 7, 0, 0, 6, 9, 0, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 1, 4, 1, 1, 2, 3, 1]])
# %%
# Loop through all values:

num_of_value = torch.zeros((len(x), x.max()+1), device=self.device)
for k in range(0,max_cluster_idx):
    num_k = (x == k).sum(dim=1)
    num_of_value[:,k] = num_k

print(num_of_value)
tensor([[ 5.,  2.,  2.,  1.,  1.,  1.,  2.,  2.,  1.,  2.],
        [ 0., 13.,  2.,  3.,  1.,  0.,  0.,  0.,  0.,  0.]])

KFrank · November 16, 2021, 6:40pm

Hi Simen!

Please try torch.nn.functional.one_hot (x).sum (dim = 1).

Best.

K. Frank

enemis · November 16, 2021, 7:07pm

Wow, yes, of course! thank you so much!

lyprince · September 13, 2023, 1:34pm

Is there a way to do this without materializing a one-hot matrix? - This is very memory-inefficient for large k

KFrank · September 14, 2023, 7:31am

Hi Lyprince!

Yes. You can use torch.bincount() if you add a row-label “offset” value to your
input tensor and then flatten it.

Consider:

>>> import torch
>>> print (torch.__version__)
2.0.1
>>>
>>> x = torch.tensor([[1, 2, 3, 4, 2, 5, 6, 7, 8, 9, 7, 0, 0, 6, 9, 0, 1, 0, 0],
...                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 1, 4, 1, 1, 2, 3, 1]])
>>>
>>> r = torch.tensor([[ 5.,  2.,  2.,  1.,  1.,  1.,  2.,  2.,  1.,  2.],
...                   [ 0., 13.,  2.,  3.,  1.,  0.,  0.,  0.,  0.,  0.]])
>>>
>>> xlim = x.max() + 1
>>> xrow = x.size (0)
>>> minl = xlim * xrow
>>>
>>> xlab = x + xlim * torch.arange (xrow).unsqueeze (1)
>>> xtmp = torch.bincount (xlab.flatten(), minlength = minl)
>>> xcnt = xtmp.reshape (xrow, xlim)
>>>
>>> torch.equal (xcnt, r.long())
True

You still have to materialize the final result of shape [rows, xlim], which can
be large if xlim is large, but you don’t have to materialize the full one-hot
tensor of shape [rows, columns, xlim], which is a factor of columns larger.

Best.

K. Frank

lyprince · September 21, 2023, 4:03pm

Thanks @KFrank

I also tried using torch.vmap, which is faster than one_hot but slower than flattening+offsetting.

With version 2.0.1 I get the following warning:

UserWarning: There is a performance drop because we have not yet implemented the batching rule for aten::bincount. Please file us an issue on GitHub so that we can prioritize its implementation. (Triggered internally at ../aten/src/ATen/functorch/BatchedFallback.cpp:82.)