How to get back the words from nn.Embedding?

ADONAI_TZEVAOT · May 3, 2021, 1:59pm

I have a sentence x = ‘I love to play football’ , here I=1 , love = 2, to = 3, play = 4, football = 5;

x = [1 2 3 4 5]
self.vocab_size = 5
self.enc_word_dim = 3
self.enc_word_vecs = nn.Embedding(self.vocab_size, self.enc_word_dim)
word_vecs = self.enc_word_vecs(x)

How to get back ‘I love to play football’ using word_vecs ? Please help!

pascal_notsawo · May 3, 2021, 3:36pm

A naive approach if you have an already predefined vocabulary (and of relatively small size), would be to store all the representations of the words of this vocabulary. Once you have a representation, you look for the word whose representation is closer (or equal, in the sense of a distance like the Euclidean distance) to that one.

For example if I have a function get_word(model : nn.Module, word : str) -> Tensor that takes a word and returns its representation and word_to_idx the dictionary containing my words and their index in the vocabulary, the following function will do the trick (with Euclidean distance):

import torch
def closest(model, vec, word_to_idx, n = 10):
    """
    finds the closest words (n) for a given vector
    """
    all_dists = [(w, torch.dist(vec, get_word(model, w))) for w in word_to_ix]
    return sorted(all_dists, key=lambda t: t[1])[:n]

I’ve done this once with Glove, but it’s definitely not a good approach.