What is the best way to predict a categorical variable, and then embed it, as input to another net?

My instances are tabular, a mix of categorical and continuous variables. I currently have a siamese net (net1) that uses the instances as input. The categorical variables are integer indices used before an nn.Embedding layer. Once net1 is trained, it is fixed.

Now, I want net0 to predict (output) these tabular instances and feed them as input to the above net, and backprop the loss through the entire network (net0 => net1), even tho net1 will be fixed and not modified.

What seems sensible to me is net0 predicts the one-hot (softmax) representation of the category, and then figure out to change net1 so that nn.Embedding uses a one-hot, not integer index, as the input. However, I donâ€™t believe nn.Embedding can accept a one-hot input (How to make nn.Embedding support one-hot encoded vector input?)

net0 predicts the one-hot (softmax) representation of the category, takes the argmax to get the integer index, and net1 looks up the nn.Embedding using the index. However, I believe that argmax is non-differentiable, which means I cannot pass losses back from net1 to net0 Differentiable argmax

I could have net0 directly predict the embedding of net1, and in postprogressing use kNN to find the nearest category. I donâ€™t love this because a) it requires the net1 embeddings to be fixed and b) it seems a little inelegant.

What is my best option here to predict a categorical variable and then embed it?

One approach that is a little cheesy, and perhaps a little slower, but will be very straightforward to implement and understand is:

Convert the categorical variable to a one-hot encoding. Instead of an nn.Embedding layer, have a 1xE (E=embedding size) weight matrix for each dimension in the one-hot. Sum all the 1xE embeddings for the category and maybe squash them?

In this case, if it is truly one-hot, then only one embedding weight matrix is active. If the input is a probability distribution over different categories, their embeddings are mixed together.

@tom what is the best way to code this, if I have many categorical variables?

If I split the categoricals into an array of one-hot tensors, then it is a little easier to construct the net. However, I am concerned this would not be performant.

If I have all the values together, then I might need to have the dataset provide a list of ranges to slice. Itâ€™s a little trickier, but is that much faster?

Also, is it correct that this will be faster if the categorical one-hot inputs are sparse tensors? I have 10-50 categorical variables, each has about 10 categories, and about 50 more continuous variables. I suspect that with such a low number of categories, that dense is faster.

So if I understand this correctly, you have several variables and each of these has a moderate number of categories.

If theyâ€™re the same categories (i.e. you have the same embedding), it is more efficient to batch them.

If you have distinct embeddings, combining them would create a block-diagonal matrix with many zeros. That usually isnâ€™t as efficient to process unless you use block-sparse variables, so Iâ€™d avoid that.

If, on the other hand, you have so many categories that using a linear layer is impractical, it is common to do negative sampling: If you first nnâ€™s output is a vector probs, you can grab the K_pos largest indices (i_pos = topk(probs, K_pos, dim=-1).indices or somesuch) and K_neg random indices (i_neg = randint((batch, K_neg)); i_all = torch.cat((i_pos, i_neg), dim=-1)). Then you take probs[:, i_all] @ emb.weights.t()[i_all] + emb.bias() as embeddings.
This is not similar to negative sampling in word2vec (I recommend Richard Socherâ€™s/Chris Manningâ€™s lectures on them if you want a very detailed account), except that there the vectors are leaves and the network following them is shallow.

@tom Thank you for the detailed feedback. I am aware of the negative sampling technique, it was originally used by Collobert + Weston (2008) in their work that was the first fast neural embedding method. (I re-implemented their embedding method and published work on , I changed my username in case you are curious about my work on word embeddings.)

I am not aware of how this would work in the context of a siamese network. Regardless, I have few categories (roughly 5-30 per category, median 10) per embedding. I have about 10 categorical variables. So this is not my issue.

You are right that this is slow.

I did the most naive thing which was decompose the categoricals into dense vectors, each with their own embedding layer with dimension 8:

This was about 10x slower with 1000 examples, as I would expect. But with 200K examples, it seems like waaaaay slower, like orders of magnitude, which I did not expect at all. Do you have any idea why? I am currently training on CPU while prototyping, FYI.

[edit: there were some floats in my input categories which exploded the number of classes]

I do understand that rewriting this network to take sparse inputs for the categories would be faster. I want to avoid this because:
If I switch to sparse inputs, then later to pass the output of a softmax into this siamese network for prediction, or moreover to train a joint network with softmax feeding the siamese, I need dense inputs.

Oh, right. Youâ€™re in the first case, so negative sampling is not applicable, and you already know all about it.
Would a large batch matmul help? You would have the difficulty amending the shorter vectors, but I think it might work.

Are you suggesting that I transform all the categoricals into a one-hot that is of fixed length?
Thank you for the advice by the way. It has been really helpful