Faster way to do multiple embeddings?

AlliedToasters · June 24, 2019, 2:10am

Hello everybody!

I’m working on a torch-based library for building autoencoders with tabular datasets.

One big feature is learning embeddings for categorical features.

In practice, however, training many embedding layers simultaneously is costing me some slowdowns. I am using for-loops to do this and running the for-loop on each iteration is (I think) what’s causing the slowdowns.

When building the model, I associate embedding layers with each categorical feature in the user’s dataset:

        for ft in self.categorical_fts:
            feature = self.categorical_fts[ft]
            n_cats = len(feature['cats']) + 1
            embed_dim = compute_embedding_size(n_cats)
            embed_layer = torch.nn.Embedding(n_cats, embed_dim)
            feature['embedding'] = embed_layer

Then, when I run .forward():

        embeddings = []
        for i, ft in enumerate(self.categorical_fts):
            feature = self.categorical_fts[ft]
            emb = feature['embedding'](codes[i])
            embeddings.append(emb)

       #num and bin are numeric and binary features
        x = torch.cat(num + bin + embeddings, dim=1)

Then x goes into dense layers.

This gets the job done but running this for loop during each forward pass really slows down training, especially when a dataset has tens or hundreds of categorical columns.

Does anybody know of a way of vectorizing something like this? Thanks!

Kotzi · May 16, 2024, 6:01pm

Try to sum up the embedded vectors. And init the embedding layers near zero. I think this could potentially work like skip connections. Works for me in my autoregressive transformer which needs more features when 1 vocabulary and 1 timeembedding. It actually works for 50 different tokenizers in parallel.