Row-wise comparisons between 2D-tensors

dirkDB · February 16, 2021, 3:03pm

Hi everyone!

I’m trying to compare all row-elements of 2 2D tensors. An easy example of would be the following two tensors

a = torch.tensor([[1,2], [4,5], [7,8]])
b = torch.tensor([[2,3], [7,5], [-1,7]])

Now I’d like to check for each element in the first tensor if it is part of the same row in the second tensor. My expected result would be

[
[False, False] (1 vs [2, 3])
[True, False] ((2 vs [2, 3)
[False, False] (4 vs  [7,5])
[False, True] (5 vs  [7,5])
[False, True] (7 vs  [-1,7])
[False, False] (8 vs  [-1,7])
]

Does anyone have any idea how to solve this efficiently?

Thanks a lot!

ptrblck · February 17, 2021, 7:34am

It certainly isn’t an efficient way memory-wise, but you might check, if it would yield a speed up compute-wise:

a = torch.tensor([[1,2], [4,5], [7,8]])
b = torch.tensor([[2,3], [7,5], [-1,7]])

ret = a.view(-1, 1, 1) == b
idx = torch.arange(3).unsqueeze(1).expand(-1, 2).reshape(-1)
print(ret[torch.arange(ret.size(0)), idx])
> tensor([[False, False],
          [ True, False],
          [False, False],
          [False,  True],
          [False,  True],
          [False, False]])

dirkDB · February 17, 2021, 2:36pm

That is very helpful, thank you very much!

Eta_C · February 18, 2021, 7:07am

Another implementation:

res = a.repeat_interleave(2, dim=1).reshape(-1, 2) == b.repeat_interleave(2, dim=0)

dirkDB · February 18, 2021, 12:00pm

Thanks a lot for this, cool to see that there are so many possibilities to solve this problem!

Eta_Cs solution seems to be quite a bit faster for large tensors (shape [10000,2]):

N=10000
a = torch.rand([N,2])
b = torch.rand([N,2])

from timeit import default_timer as timer
start = timer()
idx = torch.arange(N).unsqueeze(1).expand(-1, 2).reshape(-1)
for _ in range(500):
    ret = a.view(-1, 1, 1) == b

    res = ret[torch.arange(ret.size(0)), idx]
end = timer()
print(end - start)

start2 = timer()
for _ in range(500):
    res = a.repeat_interleave(2, dim=1).reshape(-1, 2) == b.repeat_interleave(2, dim=0)
end2 = timer()
print(end2 - start2)

121.2189056
0.11425909999999817

nastaranmarzban · December 19, 2021, 2:02pm

hi, hope you’re doing well
I have 2 tensors with unequal size

a = torch.tensor([[8,2], [5,3],[4,4]])
b = torch.tensor([[1,2],[5,3]])

I want a boolean tensor of whether each value exists in the other tensor without iterating. something like
a in b
and then we should have

[False, True, False]
would you please help me?
thanks in advance

ptrblck · December 19, 2021, 8:37pm

This should work:

a = torch.tensor([[8,2], [5,3],[4,4]])
b = torch.tensor([[1,2],[5,3]])

res = (a.unsqueeze(0) == b.unsqueeze(1)).all(dim=2).any(dim=0)
print(res)
# > tensor([False,  True, False])

The first all(dim=2) operation makes sure that all elements of the rows match while the any(dim=0) operation checks if any of the rows have matches the corresponding row in a.

nastaranmarzban · December 20, 2021, 5:39am

Thanks a lot…it’s very helpful.

nsacco · December 23, 2021, 2:12pm

Hi, I was looking for the same thing and came up with a similar solution. However, could this approach cause huge memory consumption if the tensors involved are large? If yes, is there any other possible solution that consumes few memory and does not require the use of loops? Thanks!

ptrblck · December 23, 2021, 8:52pm

Yes, the memory usage could be large since you are broadcasting the tensors and need to calculate the intermediates. Using loops would have a lower memory footprint, but could be slower. Your best bet might be to write a custom C++/CUDA operation for your use case and check if you could get a proper speedup without a large memory requirement.

nastaranmarzban · January 1, 2022, 5:51pm

Hi, happy new year…wish you a happy and healthy year
I have a question, would you please answer me?
I have 2 tensors:
tensor([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3,
3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7,
7, 7, 8, 8, 8, 8, 8, 9, 9, 10, 10, 10, 11, 12, 12, 13, 13, 13,
13, 13, 14, 14, 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, 20, 20, 21,
21, 22, 22, 23, 23, 23, 23, 23, 24, 24, 24, 25, 25, 25, 26, 26, 27, 27,
27, 27, 28, 28, 28, 29, 29, 29, 29, 30, 30, 30, 30, 31, 31, 31, 31, 31,
31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33,
33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33])
and
tensor([ 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 17, 19, 21, 31, 0, 2,
3, 7, 13, 17, 19, 21, 30, 0, 1, 3, 7, 8, 9, 13, 27, 28, 32, 0,
1, 2, 7, 12, 13, 0, 6, 10, 0, 6, 10, 16, 0, 4, 5, 16, 0, 1,
2, 3, 0, 2, 30, 32, 33, 2, 33, 0, 4, 5, 0, 0, 3, 0, 1, 2,
3, 33, 32, 33, 32, 33, 5, 6, 0, 1, 32, 33, 0, 1, 33, 32, 33, 0,
1, 32, 33, 25, 27, 29, 32, 33, 25, 27, 31, 23, 24, 31, 29, 33, 2, 23,
24, 33, 2, 31, 33, 23, 26, 32, 33, 1, 8, 32, 33, 0, 24, 25, 28, 32,
33, 2, 8, 14, 15, 18, 20, 22, 23, 29, 30, 31, 33, 8, 9, 13, 14, 15,
18, 19, 20, 22, 23, 26, 27, 28, 29, 30, 31, 32])
and I have one more tensor which name is “a” and has the size of 34*34.
I wanna access to some, but not all, elements of “a” based on the two previous tensors…
for example I need a[0][1], a[0][2] , a[0][3], a[0][4], a[0][5], a[0][6], a[0][7], a[0][8] but I don’t need a[0][9] because 9 is not in the second tensor and again I need a[1][2], a[1][3] , a[1][7] but I don’t need a[1][4] because 4 is not in the second tensor…
thanks in advance

ptrblck · January 1, 2022, 10:53pm

Direct indexing should work:

x = torch.tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
                  1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3,
                  3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7,
                  7, 7, 8, 8, 8, 8, 8, 9, 9, 10, 10, 10, 11, 12, 12, 13, 13, 13,
                  13, 13, 14, 14, 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, 20, 20, 21,
                  21, 22, 22, 23, 23, 23, 23, 23, 24, 24, 24, 25, 25, 25, 26, 26, 27, 27,
                  27, 27, 28, 28, 28, 29, 29, 29, 29, 30, 30, 30, 30, 31, 31, 31, 31, 31,
                  31, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 33,
                  33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33])

y = torch.tensor([1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 17, 19, 21, 31, 0, 2,
                  3, 7, 13, 17, 19, 21, 30, 0, 1, 3, 7, 8, 9, 13, 27, 28, 32, 0,
                  1, 2, 7, 12, 13, 0, 6, 10, 0, 6, 10, 16, 0, 4, 5, 16, 0, 1,
                  2, 3, 0, 2, 30, 32, 33, 2, 33, 0, 4, 5, 0, 0, 3, 0, 1, 2,
                  3, 33, 32, 33, 32, 33, 5, 6, 0, 1, 32, 33, 0, 1, 33, 32, 33, 0,
                  1, 32, 33, 25, 27, 29, 32, 33, 25, 27, 31, 23, 24, 31, 29, 33, 2, 23,
                  24, 33, 2, 31, 33, 23, 26, 32, 33, 1, 8, 32, 33, 0, 24, 25, 28, 32,
                  33, 2, 8, 14, 15, 18, 20, 22, 23, 29, 30, 31, 33, 8, 9, 13, 14, 15,
                  18, 19, 20, 22, 23, 26, 27, 28, 29, 30, 31, 32])

a = torch.randn(34, 34)
ret = a[x, y]

reference = []
for x_, y_ in zip(x, y):
    reference.append(a[x_, y_])
reference = torch.stack(reference)

print((ret == reference).all())
# > tensor(True)

nastaranmarzban · January 4, 2022, 4:58am

Hi, Thank you…it’s very helpful

nastaranmarzban · March 6, 2022, 3:20pm

Hi, hope you’re doing well…
I have a datasets and split it in to train_mask and test_mask…

from sklearn.model_selection import train_test_split
train_mask, test_mask= train_test_split(x, test_size=0.33, random_state = 0, shuffle = True)

then I 've used

train_mask = (x.unsqueeze(0) == train_mask.unsqueeze(1)).all(dim=2).any(dim=0)
test_mask = (x.unsqueeze(0) == test_mask.unsqueeze(1)).all(dim=2).any(dim=0)

to make it usable for PyG. After splitting I have torch.Size([33, 200])
torch.Size([17, 200]) for train_mask and test_mask but after using the above code it gives me 47 and 35 trues for train-mask and test_mask.
Where am I making mistake?Would you please help me?
x is a tensor of torch.Size([50, 200]) I’ve saved it as a pt file but because of limitation I cannot load it here.
thanks in advance

ptrblck · March 7, 2022, 7:08am

Does your dataset contain duplicates?
If so, it would be expected that your check would yield a larger number of the masks after the split, since both datasets can not contain the duplicated tensors.
You should be able to check it via x.sum(dim=1).unique().size().

nastaranmarzban · March 7, 2022, 9:00am

yes, my datasets contain duplicate…this is because I simulate data and don’t have real datasets…I have to find a way to have non identical value in my datasets…thanks very much for your prompt response.