How to work with categorical features?

ayn · May 22, 2020, 6:13am

Hi everyone! My question is regarding the use of autoencoders (in PyTorch). I have a tabular dataset with a categorical feature that has 10 different categories. Names of these categories are quite different - some names consist of one word, some of two or three words. But all in all I have 10 unique category names. What I’m trying to do is to create an autoencoder which will encode names of these categories - for example, if I have a category named 'Medium size class' , I want to see if it is possible to train autoencoder to encode this name as something like 'mdmsc' or something like that. The use of it would be to found out which data points are hard to encode or not typical or something like that. I tried to adapt autoencoder architectures from various tutorials online however nothing seems to work for me or I simply do not know how to use them as they are all about images. Maybe someone has any idea how this type of autoencoder might be accomplished if it is at all possible?

Here’s the model I have so far (I just tried to adapt some architectures I found online):

class Autoencoder(nn.Module):

def __init__(self, input_shape, encoding_dim):
    super(Autoencoder, self).__init__()

    self.encode = nn.Sequential(
        nn.Linear(input_shape, 128),
        nn.ReLU(True),
        nn.Linear(128, 64),
        nn.ReLU(True),
        nn.Linear(64, encoding_dim),
    )

    self.decode = nn.Sequential(
        nn.Linear(encoding_dim, 64),
        nn.ReLU(True),
        nn.Linear(64, 128),
        nn.ReLU(True),
        nn.Linear(128, input_shape)
    )

def forward(self, x):
    x = self.encode(x)
    x = self.decode(x)
    return x

model = Autoencoder(input_shape=10, encoding_dim=5)

And also I use LabelEncoder() and then OneHotEncoder() to give these features/categories I mentioned numerical form. However, after training, output is the same as was input (no changes on the category name) but when I try to use only encoder part I’m unable to apply LabelEncoder() and then OneHotEncoder() because of dimension issues. I feel like maybe I can do something differently at the beginning, then I try to give those features numerical form, however I’m not sure what should I do. Thank you a lot in advance!

Kushaj · May 22, 2020, 5:51pm

You can use nn.Embedding for your categorical variables. And then you can visualize the embedding to get intuitions.

aleemsidra · April 2, 2022, 2:43pm

I have categorical data: sex(M, F) and age (89Y, 55Y, 45Y, 65Y). I want to pass information about age and sex along with the image. Now when I am trying to visualize the effect of embedding for the sex category: every time I get a different vector representation for F. If the vector representation for the same word is going to be different how can it aid the prediction when concatenated with CNN extracted features?
Below are vectors, I got for “F” category by running code three times

torch.manual_seed(1)
word_to_ix = {"F": 0, "M": 1}
embeds = nn.Embedding(2, 3)  # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.tensor([word_to_ix["F"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)

print(hello_embed)tensor([[-0.7452, -0.4845,  1.4728]], grad_fn=<EmbeddingBackward0>)
tensor([[-0.7452, -0.4845,  1.4728]], grad_fn=<EmbeddingBackward0>)
tensor([[-0.7452, -0.4845,  1.4728]], grad_fn=<EmbeddingBackward0>)