Autoencoder in Pytorch to encode features/categories of data

ayn · May 21, 2020, 5:50pm

My question is regarding the use of autoencoders (in PyTorch). I have a tabular dataset with a categorical feature that has 10 different categories. Names of these categories are quite different - some names consist of one word, some of two or three words. But all in all I have 10 unique category names. What I’m trying to do is to create an autoencoder which will encode names of these categories - for example, if I have a category named ‘Medium size class’ , I want to see if it is possible to train autoencoder to encode this name as something like ‘mdmsc’ or something like that. The use of it would be to found out which data points are hard to encode or not typical or something like that. I tried to adapt autoencoder architectures from various tutorials online however nothing seems to work for me or I simply do not know how to use them as they are all about images. Maybe someone has any idea how this type of autoencoder might be accomplished if it is at all possible? Or maybe someone has a suggestion of tutorial to use?

Thank you in advance!

ptrblck · May 23, 2020, 7:04am

If I understand the use case correctly, you are dealing with names, which are corresponding to a category, i.e. they have a unique label.
You would like to pass these names (as labels or in any representation) into your model and would like to get the abbreviation of this name as the latent tensor?

ayn · May 23, 2020, 7:23am

Yes, exactly. What I tried to do is to use LabelEncoder() and OneHotEncoder() to give these names numerical form to pass them into the simple autoencoder model however I’m not sure if it’s the right way as I can’t manage to get the output I want.

ayn · May 24, 2020, 7:23am

Here is what I have so far for my model however I am not sure if it suits the task I want to do. In the end, after training, I get the same result as an input and I am not sure what I should fix. Maybe you could help me with suggestions what should I do with my model or how should I preprocess my categorical data or something else? Thanks in advance!

class Autoencoder(nn.Module):

def __init__(self, input_shape, encoding_dim):
    super(Autoencoder, self).__init__()

    self.encode = nn.Sequential(
        nn.Linear(input_shape, 128),
        nn.ReLU(True),
        nn.Linear(128, 64),
        nn.ReLU(True),
        nn.Linear(64, encoding_dim),
    )

    self.decode = nn.Sequential(
        nn.Linear(encoding_dim, 64),
        nn.ReLU(True),
        nn.Linear(64, 128),
        nn.ReLU(True),
        nn.Linear(128, input_shape)
    )

def forward(self, x):
    x = self.encode(x)
    x = self.decode(x)
    return x

model = Autoencoder(input_shape=10, encoding_dim=5)

ptrblck · May 24, 2020, 7:31am

Unfortunately I don’t know, how you could approach this problem.
The desired target sounds like a sequential task, i.e. you would pass in a sequence of letters and would expect your model to output another sequence (the abbreviation of the input), which would be similar to e.g. a translation use case.
However, I’m sure how this could be combined with an autoencoder-like architecture.