What is the best way to apply a map to a tensor?

rudolfovic · July 4, 2024, 12:14pm

I am using embedding tables for a bunch of categorical features. Some of those categorical features are actually lists of categories.

Most of these values are hashed strings and that pretty much covers the entire int64 range.

So far, I’ve been using a modulo to cut down the number of unique values to a few hundred thousand but that comes with a lot of problems: collisions, inability to map unknown values to an OOV placeholder, wasted memory etc.

I would like to implement a simple mapping where for each feature each value gets mapped to a dense index in the embedding table (this way rare and unseen values can be easily mapped to 0) but I haven’t managed to do this successfully in pytorch yet.

There is not a dedicated “map” function and so it seems that this can only be done using .apply_ or .item() which would move the computation to the CPU and will probably slow it down unreasonably.

I’ve search all over GitHub, Stack Overflow and the rest of the internet and the proposed solution seems to be “just map / tokenize your data before passing it to the model”. Unfortunately, in our case this is not possible as we are required to provide a single ONNX model object that is able to receive features “as-is”. I’ve also tried adding these mappings separately to the ONNX graph, either by manually adding these mapping nodes or by exporting sklearn’s DictVectorizer / LabelEncoder separately and then connecting its outputs to the pytorch model’s inputs.

Am I missing something really obvious here? This is something that I could do rather easily in Tensorflow (to the point where my inputs didn’t need to be ints at all and I could just pass strings in directly) so I don’t believe it can be that complicated in torch. On the other hand, torch tokenizers seem to always come separately from its models so maybe there is a reason for that (I understand why you might want to keep the tokenizers separate from the architecture but I don’t understand why you couldn’t package the two together to encapsulate the end-to-end behaviour of the model).

Should I not be that worried about running this part on the CPU? I am using the eager API so I am guessing it would block the rest of the computations from taking place. Must I switch to compiled graphs to benefit from this?

I just looked up sparse arrays and it seems like something that could be helpful but no one talks about using them this way so again it feels like I might be missing something. Can something like this be implemented using sparse arrays where my mapping goes like {111111:2, 222222: 3} and so if I have this sparse tensor along the lines of [0…2…3…0] I guess I could do something like sparse_tensor[my_feature] and get something like [2,3,2] for [111111, 222222, 111111]? I think also that’s maybe how I implemented this in tensorflow in the first place but it’s been almost 10 years so I imagined this would be a legacy solution by now.

I did see this topic but it hasn’t really been resolved: Map the value in a tensor using dictionary

Eyal_Trabelsi · July 21, 2024, 8:25am

This is really basic and important, any idea anyone?

AlphaBetaGamma96 · July 21, 2024, 3:35pm

You can have a look at the torch.utils._pytree.tree_map (source on Github here: pytorch/torch/utils/_pytree.py at main · pytorch/pytorch · GitHub)

rudolfovic · July 23, 2024, 7:03am

Thanks for responding but I don’t really understand. All this does under the hood is apply a function to a list of elements via map, just with the additional complication of parsing a tree structure (which I don’t have; all I have is a batch of lists). If apply_ requires switching to cpu why wouldn’t this?

AlphaBetaGamma96 · July 24, 2024, 8:08am

Can you share a minimal reproducible example explaining your problem?

rudolfovic · July 25, 2024, 11:26am

import torch
t = torch.tensor([[1,2,3], [4,5,6], [7,8,9]]).to(device='cuda')
m = dict(zip(range(10), range(10,0, -1)))
t.apply_(lambda x: m[x]).device

This will fail unless t goes back to cpu. There are multiple other ways e.g. np.vectorize(lambda x: m[x])(t.to('cpu').numpy()) or using item() but it always boils down to needing to go back to the cpu.

I don’t really know how much of a bottle neck this would be but for now I’m using this ugly hack where I am premapping all my values beforehand in training (and validation) since training and validation run on gpu. And since, inference runs on CPU anyway I’m checking if device.startswith(‘cpu’) to apply the mapping using np vectorize.