Can anybody explain the importance of Embedding for Categorical Variables?

I’m new to PyTorch. I’m working in ANN project and using PyTorch in this project. I didn’t understand the concept of using Embedding for categorical variables. Is it similar to get the dummy variables from Pandas?

What do you mean by categorical variables?

The categorical variables which has ordinal values.

Is it target value for classification task?

Nope. It’s for regression problem. I have some categorical variables in my datasets. In ML, I usually deal with Pandas One hot encoding. But in PyTorch, I cam across embedding concept.

I’ve never seen this before.
Usually embedding layer is used for embedding token’s (encoded words) as it maps it to matrix [sentence_length, embedding_dim] from [sentence_length] and this mapping supposed to have some meaning like king-man+woman=quin in term of vectors.

Dear @kumareshbabuns,
Dummy variables and embeddings (or word embeddings) are two different things. Both are vector representations for categorical variables. The former is a sparse representation where only one of the values of each vector representation is 1 rest being 0. 'Embeddings" are a dense vector representation for categorical variables or words, learned using some other neural network.

One way to understand this in practice could be, say you have 3 categorical variables - a, b and c in your features. Using one hot encoding or pandas dummies you can get a vector representation for a,b, c as [1,0,0] , [0,1,0] , [0,0,1] respectively. Since, there are 3 categories, the size of the one hot encoded vectors is equal to 3, with one value being 1. Imagine you have 10000 unique categorical features, each category will then have to be represented using a vector of size 10000 (hence, sparse).

In this case using ‘Embeddings’ could be useful, as you can then learn a dense vector representation (after adequate training) of any size, for each categorical feature. This is particularly the case while dealing with text data as usually high number of unique words exists in a corpus. Therefore, using pre-trained word embeddings or training word embeddings on the fly (using embedding layer) is useful. Remember, to do this you will need to feed the network with a lot of text which contains the words that exist in the dataset. For e.g. to get a general representation of the word ‘sun’ you will need a lot of sentences that contain the word ‘sun’.

In a scenario where the input is a mix of categorical and numerical features, one can either use pre trained word embeddings for categories and concatenate them with the numerical features. E.g. if one input sample is -> [1,20, a, b], you can either pre process the input and replace ‘a’ and ‘b’ with the pandas dummies. In this case the input will look like - [1, 20, 1,0 , 0,1] (considering only ‘a’ and ‘b’ exists).
OR
Use an embedding layer for processing ‘a’ and ‘b’ part of the input and concatenate the output of embedding layer with [1,20]. Therefore the input become - [1, 20, <embedding for a>, <embedding for b>]

Hope this helps.
Thank you

3 Likes

Hi @anksng , I am sorry for responding so late but i found your comment and it is exactly what I am looking for! I wish to execute your last example (taking away continuous features and concatenating the categorical ones later), but I don’t know where to start. Would you happen to know a tutorial or a good way to perform this embedding treatment of the categorical features? I currently have a dataset of 27 features, of which 20 are categorical. I wish to extract the 20 categorical features, treat them with embedding, and them add them back to the dataset when done…

I hope you can help me!