What is the best way to predict a categorical variable, and then embed it, as input to another net?
My instances are tabular, a mix of categorical and continuous variables. I currently have a siamese net (net1) that uses the instances as input. The categorical variables are integer indices used before an nn.Embedding layer. Once net1 is trained, it is fixed.
Now, I want net0 to predict (output) these tabular instances and feed them as input to the above net, and backprop the loss through the entire network (net0 => net1), even tho net1 will be fixed and not modified.
What seems sensible to me is net0 predicts the one-hot (softmax) representation of the category, and then figure out to change net1 so that nn.Embedding uses a one-hot, not integer index, as the input. However, I don’t believe nn.Embedding can accept a one-hot input (How to make nn.Embedding support one-hot encoded vector input?)
net0 predicts the one-hot (softmax) representation of the category, takes the argmax to get the integer index, and net1 looks up the nn.Embedding using the index. However, I believe that argmax is non-differentiable, which means I cannot pass losses back from net1 to net0 Differentiable argmax
I could have net0 directly predict the embedding of net1, and in postprogressing use kNN to find the nearest category. I don’t love this because a) it requires the net1 embeddings to be fixed and b) it seems a little inelegant.
What is my best option here to predict a categorical variable and then embed it?
One approach that is a little cheesy, and perhaps a little slower, but will be very straightforward to implement and understand is:
Convert the categorical variable to a one-hot encoding. Instead of an nn.Embedding layer, have a 1xE (E=embedding size) weight matrix for each dimension in the one-hot. Sum all the 1xE embeddings for the category and maybe squash them?
In this case, if it is truly one-hot, then only one embedding weight matrix is active. If the input is a probability distribution over different categories, their embeddings are mixed together.
@tom what is the best way to code this, if I have many categorical variables?
If I split the categoricals into an array of one-hot tensors, then it is a little easier to construct the net. However, I am concerned this would not be performant.
If I have all the values together, then I might need to have the dataset provide a list of ranges to slice. It’s a little trickier, but is that much faster?
Also, is it correct that this will be faster if the categorical one-hot inputs are sparse tensors? I have 10-50 categorical variables, each has about 10 categories, and about 50 more continuous variables. I suspect that with such a low number of categories, that dense is faster.
So if I understand this correctly, you have several variables and each of these has a moderate number of categories.
If they’re the same categories (i.e. you have the same embedding), it is more efficient to batch them.
If you have distinct embeddings, combining them would create a block-diagonal matrix with many zeros. That usually isn’t as efficient to process unless you use block-sparse variables, so I’d avoid that.
If, on the other hand, you have so many categories that using a linear layer is impractical, it is common to do negative sampling: If you first nn’s output is a vector probs, you can grab the K_pos largest indices (i_pos = topk(probs, K_pos, dim=-1).indices or somesuch) and K_neg random indices (i_neg = randint((batch, K_neg)); i_all = torch.cat((i_pos, i_neg), dim=-1)). Then you take probs[:, i_all] @ emb.weights.t()[i_all] + emb.bias() as embeddings.
This is not similar to negative sampling in word2vec (I recommend Richard Socher’s/Chris Manning’s lectures on them if you want a very detailed account), except that there the vectors are leaves and the network following them is shallow.
@tom Thank you for the detailed feedback. I am aware of the negative sampling technique, it was originally used by Collobert + Weston (2008) in their work that was the first fast neural embedding method. (I re-implemented their embedding method and published work on , I changed my username in case you are curious about my work on word embeddings.)
I am not aware of how this would work in the context of a siamese network. Regardless, I have few categories (roughly 5-30 per category, median 10) per embedding. I have about 10 categorical variables. So this is not my issue.
You are right that this is slow.
I did the most naive thing which was decompose the categoricals into dense vectors, each with their own embedding layer with dimension 8:
This was about 10x slower with 1000 examples, as I would expect. But with 200K examples, it seems like waaaaay slower, like orders of magnitude, which I did not expect at all. Do you have any idea why? I am currently training on CPU while prototyping, FYI.
[edit: there were some floats in my input categories which exploded the number of classes]
I do understand that rewriting this network to take sparse inputs for the categories would be faster. I want to avoid this because:
If I switch to sparse inputs, then later to pass the output of a softmax into this siamese network for prediction, or moreover to train a joint network with softmax feeding the siamese, I need dense inputs.
Oh, right. You’re in the first case, so negative sampling is not applicable, and you already know all about it.
Would a large batch matmul help? You would have the difficulty amending the shorter vectors, but I think it might work.
Are you suggesting that I transform all the categoricals into a one-hot that is of fixed length?
Thank you for the advice by the way. It has been really helpful