Ways to handle collisions when assigning multiple entries to the same location in a torch tensor

jtl · March 7, 2023, 10:32am

Hi,

I have a question regarding vectorized indexing to assign entries from the source tensor to the target when there are collisions in the target indexes.

Specifically, considering the following example:

import torch

device = torch.device("cpu")

source_tensor = torch.tensor(
    [1.23, 2.68, 9.37, 5.84, 6.87, 5.21, 7.46, 2.33, 8.95],
    dtype=torch.float32,
    device=device,
)

indexes = torch.tensor(
    [0, 0, 1, 0, 0, 1, 0, 1, 1],
    dtype=torch.long,
    device=device,
)

target_tensor = torch.zeros(2, dtype=torch.float32, device=device)

target_tensor[indexes] = source_tensor
print(target_tensor)

We have multiple entries from the source tensor got assigned to the same location (0 and 1) in the target tensor. In this example, the number of points are still small. But in my practical case, the number of points can be 10K+.

The first issue I encountered is that I got different results if I change device to torch.device("cuda"). It means that the ways it handles collisions are different between GPU and CPU. How do I make it consistent?
Same as the issue mentioned above, in this example, the CPU one seems to be assigning the entries sequentially (at least on my desktop), so one would get tensor([7.4600, 8.9500]) as the result (they are the last ones got assigned to location 0 and 1 separately). But when I did it in my practical case (> 10K points), what I encountered was that it’s also not done sequentially even using CPU. Is there a way to enforce the sequential behavior on both GPU and CPU? Because then we can potentially pre-sort the tensor if we want to always fetch the minimum when collisions happen? One might suggest that I can simply remove the collisions, but I don’t know a way to do that through vectorized operations, and that makes the process super slow.

Another potential solution is the Tensor.index_reduce_() . By using this function, we can specify the rules like amin, etc. However, this function won’t give me the inverse index. i.e. I can’t infer which entry from the source_tensor got selected. Is there any suggestion how do I get the inverse indexes?

Thanks

ptrblck · March 8, 2023, 12:29am

I don’t think there is a way to make the result consistent between GPU and CPU as both are using undefined behavior and the result depends on the order of operations.
The right approach would be to use e.g. index_reduce_ or to remove the duplicates as you’ve already mentioned.
Could you describe your last comment in more details and the concern regarding the “inverse indices” and how these should be created?

jtl · March 8, 2023, 9:30am

Hi @ptrblck ,

Thank you for the reply! So the reason why I need the inverse indices is that actually the source_tensor and the target_tensor have multiple channels. For example, source_tensor can have shape (N, 10), N indicates number of points and 10 is the feature dimension. However, within these 10 channels, we want to use one of them to handle the collisions.

For instance, one of them can be the distances, and when collision happens, we always want to fetch the one with the smallest distance. With Tensor.index_reduce_(), we can fetch that specific channel out through source_tensor[:, 0] (imagine that it’s in the first channel), and perform the reduction. We will have the correct results. However, we don’t know which entry got fetched from the original tensor. Thus, we do not know how to assign the rest 9 channels correctly.

So in my example in the post, the inverse indices would be [0, 7] if the reduction rule is amin. Among all entries that are assigned to target index 0, the index 0 from the source_tensor is the smallest one. Similarly, among all entries that are assigned to target index 1, the index 7 from the source_tensor is the smallest one.

ptrblck · March 8, 2023, 8:25pm

Thanks for the explanation.
Indeed the indices won’t be returned and it seems you might need to use another operation to return them as seen here:

device = torch.device("cpu")

source_tensor = torch.tensor(
    [1.23, 2.68, 9.37, 5.84, 6.87, 5.21, 7.46, 2.33, 8.95],
    dtype=torch.float32,
    device=device,
)

indexes = torch.tensor(
    [0, 0, 1, 0, 0, 1, 0, 1, 1],
    dtype=torch.long,
    device=device,
)

target_tensor = torch.zeros(2, dtype=torch.float32, device=device)

target_tensor = torch.zeros(2, dtype=torch.float32, device=device)
target_tensor.index_reduce_(0, indexes, source_tensor, reduce="amin", include_self=False)
print((target_tensor.unsqueeze(1) == source_tensor.unsqueeze(0)).nonzero()[:, 1])
# tensor([0, 7])

I would also assume you would use a similar approach to get the indices or how are you returning the indices right now?

jtl · March 10, 2023, 10:14am

Hi @ptrblck ,

Thank you for the prompt reply. Unfortunately, the source tensor and the target tensor in our case have more than 100K entries. Therefore, the suggested solution is not feasible in this case. Even it’s vectorized, but the memory and time complexities are O(N^2).

The current solution I’m using is a bit hacky. If one wants to keep the min or max using a specific channel, he/she can achieve that by hiding the indices in the lower digit and shift the value parts to the higher digits.

Of course the amount of left shifting depends on the precision you want to keep and the number of points you have. After the reduction, just remove the value parts and you get the inverse indices.

Not a good solution though, would be nice if they support inverse indices in the future.

ptrblck · March 10, 2023, 7:15pm

Yes, the memory requirement could be large due to the broadcasting depending on the input tensor shapes.

Could you describe your “hacky” approach a bit more and maybe post a code snippet, please?