How to encode categorical data that have variable length so could be fetched to nn.Embedding in PyTorch

Let’s say i have a data field named movie_genre for each sample movie , it is selected from the following genres:

Action
Adventure
Animation
Comedy
...

And for each movie , it might contain multiple genres:

mid    genres
1      Action | Adventure
2      Animation
3      Comedy | Adventure | Action

which means, the movie’s genres is a variable list.

If i use one hot vector to encode the genre , Action can be encoded as (1, 0, 0, 0), Adventure can be encoded as(0, 1, 0, 0), and so on.

So movie with mid1 can be encoded as (1, 1, 0, 0), mid2’s genre can be encoded as (0, 0, 1, 0), and so on.

However, the pytorch embedding layer nn.Embedding takes tensor containing the indices as input, but not one-hot vector. So how should i encode the data so that it can be fetched into the embedding layer?

here’s the relative link in stackoverflow

1 Like