nn.Embedding layer in categorical features

erkinovic · April 7, 2022, 12:02pm

Hi everyone,

These forums have helped me greatly but unfortunately i can’t find a problem similar to the one i have run into now.

For context, I am creating a VAE in order to learn data distribution and create a mock dataset with similar statistical properties of the original dataset (synthetic data generation). Now I am working with a heavily categorical value based dataset (20 out of 27 variables are categorical) and I have learned I can use nn.Embedding to deal with these categorical values. I have tried one-hot-encoding them all but this resulted in an explosion of features (resulted in over 120 features which made my data very sparse).

However, I have noticed that nn.Embedding is primarily used in NLP and vocabularies with a set amount of words. Now I am dealing with features that all have different “vocabularies” AKA amount of categories and I am wondering on how I can correctly implement the nn.Embedding layer and correctly let the network know which features are categorical and should not be treated continuously.

Example: one feature has 2 categories (binary) and one has 10 categories (different positions in a company). Is there a way to satisfy all category sizes with nn.Embedding?

Thank you in advance!