Behavior of torch.unique

sk-g · June 13, 2020, 5:43am

I expected either torch.unique or torch.unique_consecutive to give a tensor of unique elements instead, I get a random shuffle of the elements.

Here is what I did , please let me know if this is the correct way of invoking things.

a = torch.tensor([[3,1,3,3,4], [2,1,4,3,1]], dtype=torch.long)
output = torch.unique(a, sorted=True, dim=1)
output = torch.unique_consecutive(output, dim=1)
a.shape, output.shape, output
(torch.Size([2, 5]),
 torch.Size([2, 5]),
 tensor([[1, 3, 3, 3, 4],
         [1, 2, 3, 4, 1]]))

Ok, fine, chaining maybe bad (but it is not, really, unique_consecutive expects duplicate elements to be consecutive which I believe is achieved by sorted=True. Well, probably sorting is done on the same dim. So this is a mystery but ok.

With the same a = torch.tensor([[3,1,3,3,4], [2,1,4,3,1]], dtype=torch.long)

output = torch.unique(a, dim=1) yields

tensor([[1, 3, 3, 3, 4],
        [1, 2, 3, 4, 1]])

and output = torch.unique(a, dim=0) yields

tensor([[1, 3, 3, 3, 4],
        [1, 2, 3, 4, 1]])

Clearly I am misunderstanding the usage of torch.unique

I expected to achieve something like [list(set(x.tolist())) for x in a] which clearly cannot be a tensor because not all dimensions would have same number of unique elements. So if the return type is a tensor, somewhere something is casting/padding additional elements to satisfy tensor type return, which of course beats the purpose as the elements are now repeated.

In the simple example for a, such a unique operation yields -

z = [list(set(x.tolist())) for x in a]
z
[[1, 3, 4], [1, 2, 3, 4]]

Please advice.
For now I have this work around (I need the unique indices in the order of appearance, I could use OrderedDict but that takes more space)

def unique(tensor):
    '''
    Returns a torch tensor with unique values from given tensor with order preserved.
    '''
    tensor_list = tensor.tolist()
    seen = set()
    unique_values = []
    for val in tensor_list:
        if val not in seen:
            seen.add(val)
            unique_values.append(val)
    return tensor.new(unique_values)

The problem with this function is that it expects the tensor to be 1-D so for a multi-dimensional tensor, I have to apply this in loop all of which are CPU operations and inherently would require loading big tensors as a copy because of tolist(), I was hoping to rely on CUDA based implementation but due to sort I cannot. Still, it will be nice to at least have torch.unique() which does not return repeated values.

It will be of a great help if someone can update the docs with an example that would hit this problem. (Perhaps the same a?)

If I am in the wrong in understanding the way to use these two functions, please let me know, it will be a tremendous help, in this case, using a more complicated example in the docs would further help others.

Many thanks!

ptrblck · June 13, 2020, 10:17am

torch.unique with a dim argument will return the unique tensors in the specified dimension, while dim=None will treat the tensor as a flattened tensor.
Have a look at this example:

x = torch.tensor([[0, 0, 0],
                  [0, 0, 1],
                  [0, 0, 0],
                  [0 ,0, 1]])

print(torch.unique(x, dim=0))
> tensor([[0, 0, 0],
        [0, 0, 1]])
print(torch.unique(x, dim=1))
> tensor([[0, 0],
        [0, 1],
        [0, 0],
        [0, 1]])
print(torch.unique(x))
> tensor([0, 1])

If you want to get the flattened unique values for each row, you would need to use a loop, as torch.unique won’t add padding to the result.

sk-g · June 13, 2020, 8:50pm

Thank you very much! So I was correct in understanding that there is padding.

While this padding is required for building a tensor, it is quite misleading to see repeats. Updated the example in docs with your example above would be really really really helpful.

For my use case, I will stick to my implementation then as there is no way but to use a loop to avoid padded duplicates. This is mostly because I need the ordered unique. However, the alternative would be to get the relative indices and then shuffle.

ptrblck · June 13, 2020, 11:54pm

I don’t see where the padding is.
My code snippet returns the unique rows, then the unique columns, and finally the unique scalars.
Where are these tensors potentially padded?

sk-g · June 14, 2020, 12:11am

You are right, I read your first comment more carefully after reading this in your second

My code snippet returns the unique rows, then the unique columns, and finally the unique scalars.

Essentially using dim I can select unique rows or unique dimension as a whole. This has nothing to do with padding!! So the usage gives “unique rows” and not “here are the unique elements of each row”

What I should be looking at is the scalars themselves within each dimension which is to say what are the unique scalars in each entry of the dimension specified.

Thanks, I now understand things better.