Hi all,
I am looking for an idiomatic way to make batched multi-hot vectors (multi-hot being like a one-hot but several can be hot). For example, we might have
[[1. 0. 1. 0. 0. 0.]
[0. 1. 0. 0. 1. 0.]
[0. 1. 0. 0. 0. 1.]]
as a sequence of three 2-hot vectors generated from the sequence [a b c] where a maps to classes 0 and 2, b maps to 2 and 4, etc.
Suppose I have function that maps elements in a sequence to a tuple of their class indices and I have a set of batched sequences of the same length. Is there an idiomatic way to transform the sequences to their batched, multi-hot representations? More specifically, I have a tensor of size (Batch Size X Sequence Length)
and I want to build one that is (Batch Size X Sequence Length X Classes)
where the feature dimension can be multi-hot based on a mapping from sequence elements to tuples of classes.
There is a one-hot version like:
def one_hot_encode(arr, n_labels):
# Initialize the the encoded array
one_hot = np.zeros((np.multiply(*arr.shape), n_labels), dtype=np.float32)
# Fill the appropriate elements with ones
one_hot[np.arange(one_hot.shape[0]), arr.flatten()] = 1.
# Finally reshape it to get back to the original array
one_hot = one_hot.reshape((*arr.shape, n_labels))
return one_hot
and I have had luck summing two, batched, one-hot encoding arrays together, but this strikes me as inelegant.
Some notes:
-
This is for input, not for a multi-label classification output.
-
I realize it would be possible to just do a one-hot encoding based on the Cartesian product of class pairs, but in my actual use case the number of classes and possible classes per element is such that combinatorial explosion makes one-hot encoding them infeasible.
Thanks in advance.