Can anybody explain the importance of Embedding for Categorical Variables?

kumareshbabuns · April 10, 2020, 5:13pm

I’m new to PyTorch. I’m working in ANN project and using PyTorch in this project. I didn’t understand the concept of using Embedding for categorical variables. Is it similar to get the dummy variables from Pandas?

jubick · April 10, 2020, 5:27pm

What do you mean by categorical variables?

kumareshbabuns · April 10, 2020, 5:31pm

The categorical variables which has ordinal values.

jubick · April 10, 2020, 5:33pm

Is it target value for classification task?

kumareshbabuns · April 10, 2020, 5:46pm

Nope. It’s for regression problem. I have some categorical variables in my datasets. In ML, I usually deal with Pandas One hot encoding. But in PyTorch, I cam across embedding concept.

jubick · April 10, 2020, 5:56pm

I’ve never seen this before.
Usually embedding layer is used for embedding token’s (encoded words) as it maps it to matrix [sentence_length, embedding_dim] from [sentence_length] and this mapping supposed to have some meaning like king-man+woman=quin in term of vectors.

anksng · April 10, 2020, 10:38pm

Dear @kumareshbabuns,
Dummy variables and embeddings (or word embeddings) are two different things. Both are vector representations for categorical variables. The former is a sparse representation where only one of the values of each vector representation is 1 rest being 0. 'Embeddings" are a dense vector representation for categorical variables or words, learned using some other neural network.

One way to understand this in practice could be, say you have 3 categorical variables - a, b and c in your features. Using one hot encoding or pandas dummies you can get a vector representation for a,b, c as [1,0,0] , [0,1,0] , [0,0,1] respectively. Since, there are 3 categories, the size of the one hot encoded vectors is equal to 3, with one value being 1. Imagine you have 10000 unique categorical features, each category will then have to be represented using a vector of size 10000 (hence, sparse).

In this case using ‘Embeddings’ could be useful, as you can then learn a dense vector representation (after adequate training) of any size, for each categorical feature. This is particularly the case while dealing with text data as usually high number of unique words exists in a corpus. Therefore, using pre-trained word embeddings or training word embeddings on the fly (using embedding layer) is useful. Remember, to do this you will need to feed the network with a lot of text which contains the words that exist in the dataset. For e.g. to get a general representation of the word ‘sun’ you will need a lot of sentences that contain the word ‘sun’.

In a scenario where the input is a mix of categorical and numerical features, one can either use pre trained word embeddings for categories and concatenate them with the numerical features. E.g. if one input sample is -> [1,20, a, b], you can either pre process the input and replace ‘a’ and ‘b’ with the pandas dummies. In this case the input will look like - [1, 20, 1,0 , 0,1] (considering only ‘a’ and ‘b’ exists).
OR
Use an embedding layer for processing ‘a’ and ‘b’ part of the input and concatenate the output of embedding layer with [1,20]. Therefore the input become - [1, 20, <embedding for a>, <embedding for b>]

Hope this helps.
Thank you

erkinovic · April 7, 2022, 2:10pm

Hi @anksng , I am sorry for responding so late but i found your comment and it is exactly what I am looking for! I wish to execute your last example (taking away continuous features and concatenating the categorical ones later), but I don’t know where to start. Would you happen to know a tutorial or a good way to perform this embedding treatment of the categorical features? I currently have a dataset of 27 features, of which 20 are categorical. I wish to extract the 20 categorical features, treat them with embedding, and them add them back to the dataset when done…

I hope you can help me!

Zeppelin_Page · January 12, 2023, 9:43am

Hi Erkinovic

Probably too late to reply. Even I have found this topic very confusing. So I took some time to come up with a blog post to explain how one can use categorical features and create an embedding layer out of it. Possibly, it can also be combined with continuous features as well

There is a Colab Notebook as well to experiment with this concept.

Hope the community benefits from this!!