Let’s say I have two tensors:
samples = torch.tensor([50, 60, 40, 50, 50, 60])
labels = torch.tensor([0, 0, 1, 2, 1, 1])
What I would like to do is to efficiently retrieve the values in labels for each unique value in samples. So the expected output would be three tensors (assuming ordered samples, such as when using torch.unique):
torch.tensor([1])
torch.tensor([0, 2, 1])
torch.tensor([0, 1])
Now, of course, I could do torch.unique(samples) to get the unique samples, and use this in a loop to index labels, for example labels[samples == 50] yields the expected torch.tensor([0, 2, 1]). However, I am not sure if that is efficient as I am still using a for loop. Should I do something smart with the return_inverse parameter of torch.unique? I haven’t solved the puzzle yet!
A probably even more inefficient way would be something like this (this is what I currently do):
occurrences = defaultdict(list)
for sample, label in zip(samples, labels):
key = sample.flatten().tolist()
occurrences[tuple(key)].append(label)
In my specific problem, samples is either a Python list of 2D tensors or a large 3D tensor (based on whether the tensors could be stacked, since the last dimension may vary depending on the configuration). Why I would like to collect the occurrences in labels for every unique sample is because I would like to calculate the entropy of these labels. So, currently, after the above code, I have something like this:
for sample, label_occurrences in occurrences.items():
label_occurrences = torch.tensor(label_occurrences)
_, counts = torch.unique(label_occurrences, return_counts=True)
etr = entropy(counts / label_occurrences.shape[0], base=num_classes)
This also feels very inefficient. How can I make my code more efficient(/vectorised)?
(Bonus: I am using SciPy’s entropy function in the last snippet; probably there are even faster ways to calculate these numbers using Torch’s built-in operations?)