I am using embedding tables for a bunch of categorical features. Some of those categorical features are actually lists of categories.
Most of these values are hashed strings and that pretty much covers the entire int64 range.
So far, I’ve been using a modulo to cut down the number of unique values to a few hundred thousand but that comes with a lot of problems: collisions, inability to map unknown values to an OOV placeholder, wasted memory etc.
I would like to implement a simple mapping where for each feature each value gets mapped to a dense index in the embedding table (this way rare and unseen values can be easily mapped to 0) but I haven’t managed to do this successfully in pytorch yet.
There is not a dedicated “map” function and so it seems that this can only be done using .apply_
or .item()
which would move the computation to the CPU and will probably slow it down unreasonably.
I’ve search all over GitHub, Stack Overflow and the rest of the internet and the proposed solution seems to be “just map / tokenize your data before passing it to the model”. Unfortunately, in our case this is not possible as we are required to provide a single ONNX model object that is able to receive features “as-is”. I’ve also tried adding these mappings separately to the ONNX graph, either by manually adding these mapping nodes or by exporting sklearn’s DictVectorizer / LabelEncoder separately and then connecting its outputs to the pytorch model’s inputs.
Am I missing something really obvious here? This is something that I could do rather easily in Tensorflow (to the point where my inputs didn’t need to be ints at all and I could just pass strings in directly) so I don’t believe it can be that complicated in torch. On the other hand, torch tokenizers seem to always come separately from its models so maybe there is a reason for that (I understand why you might want to keep the tokenizers separate from the architecture but I don’t understand why you couldn’t package the two together to encapsulate the end-to-end behaviour of the model).
Should I not be that worried about running this part on the CPU? I am using the eager API so I am guessing it would block the rest of the computations from taking place. Must I switch to compiled graphs to benefit from this?
I just looked up sparse arrays and it seems like something that could be helpful but no one talks about using them this way so again it feels like I might be missing something. Can something like this be implemented using sparse arrays where my mapping goes like {111111:2, 222222: 3} and so if I have this sparse tensor along the lines of [0…2…3…0] I guess I could do something like sparse_tensor[my_feature] and get something like [2,3,2] for [111111, 222222, 111111]? I think also that’s maybe how I implemented this in tensorflow in the first place but it’s been almost 10 years so I imagined this would be a legacy solution by now.
I did see this topic but it hasn’t really been resolved: Map the value in a tensor using dictionary