I just started NN few months ago , now playing with data using Pytorch. I learnt how we use embedding for high cardinal data and reduce it to low dimensions. There is one thumb of role i saw that for reducing high dimensional categorical data in the form of embedding you use following formula
embedding_sizes = [(n_categories, min(50, (n_categories+1)//2)) for _,n_categories in embedded_cols.items()]
[(69, 35), (11, 6)]
After creating this embedding layer how do we know these embedding layers are appropriate for MLP ? Do we check score with different size of embedding layer or do we visualise this layer? If we visualise then what are the ways to visualise , in short how can i validate that my embedding layers are good in terms of its reduced size from 69 to 35 and 11 to 6?
The embedding layer is just a look up table. So you pass an index and an embedding vector is returned. When you initialize the embedding layer, these are just random values. After training the embeddings, you can try the following to check the quality of the embeddings
- Check the metric. As everything kept same, the metric value of 69 dim and 35 dim embedding can give you some idea of the quality of embeddings
- You can use PCA to visualize the embeddings in 2D space. Although this is not a good approach.
@Kushaj thanks Kushaj , i think i need to extract those layers after training and then extract those to visualize with target variable , how can i extract these embeded layers weights after trainig?
layer.weight to get the embedding matrix. In general whenever you want to extract something from any layer in pytorch just look up at the
__init__ function in the source code.