Hello everybody!
I’m working on a torch-based library for building autoencoders with tabular datasets.
One big feature is learning embeddings for categorical features.
In practice, however, training many embedding layers simultaneously is costing me some slowdowns. I am using for-loops to do this and running the for-loop on each iteration is (I think) what’s causing the slowdowns.
When building the model, I associate embedding layers with each categorical feature in the user’s dataset:
for ft in self.categorical_fts:
feature = self.categorical_fts[ft]
n_cats = len(feature['cats']) + 1
embed_dim = compute_embedding_size(n_cats)
embed_layer = torch.nn.Embedding(n_cats, embed_dim)
feature['embedding'] = embed_layer
Then, when I run .forward():
embeddings = []
for i, ft in enumerate(self.categorical_fts):
feature = self.categorical_fts[ft]
emb = feature['embedding'](codes[i])
embeddings.append(emb)
#num and bin are numeric and binary features
x = torch.cat(num + bin + embeddings, dim=1)
Then x
goes into dense layers.
This gets the job done but running this for loop during each forward pass really slows down training, especially when a dataset has tens or hundreds of categorical columns.
Does anybody know of a way of vectorizing something like this? Thanks!