LSTM for word prediction

vdw · January 5, 2020, 12:39am

When dealing with texts, you usually first build your own Vocabulary to map between words and indexes, and vice versa. This could be a utility class like like this (this is a class I wrote myself):

class Vocabulary:

    def __init__(self, default_indexes={}):
        self.default_indexes = {**default_indexes}
        self.init()

    def init(self):
        self.index_to_word = {**self.default_indexes}
        self.word_to_index = {}
        self.word_counts = {}
        self.num_words = len(self.default_indexes)
        for idx, word in self.index_to_word.items():
            self.word_to_index[word] = idx

    def index_words(self, word_list):
        for word in word_list:
            self.index_word(word)

    def index_word(self, word, cnt=None):
        if word not in self.word_to_index:
            self.index_to_word[len(self.index_to_word)] = word
            self.word_to_index[word] = len(self.word_to_index)
            if cnt is None:
                self.word_counts[word] = 1
                self.num_words += 1
            else:
                self.word_counts[word] = cnt
                self.num_words += cnt
        else:
            if cnt is None:
                self.word_counts[word] += 1
            else:
                self.word_counts[word] += cnt

    def get_words(self, indices):
        return [self.index_to_word[i] if i in self.index_to_word else None for i in indices ]

# Testing
vocabulary = Vocabulary(default_indexes={0: '<pad>', 1: '<unk>'})
print(vocabulary.index_to_word)
vocabulary.index_word('test')
print(vocabulary.index_to_word)

Essentially, you now have the dictionary self.word_to_index that maps a word in your dataset to an index, e.g.:

self.word_to_index ={0: '<pad>', 1: '<unk>', 2: 'and', 3: 'I', 4: 'the', 5: 'be', ...}

Given a sentence “I will be tired and exhausted”, you can use this dictionary to convert this sentence into a tensors, e.g., input = [3, 73, 5, 310, 2, 511] (maybe with padding in case of batches). Now, input is what you give to ‘self.embeddings’ – you do not five the embedding layer words!

In case you use pre-trained word embeddings, yes, you have to make sure that the embedding at position, say, 5 does indeed represent the word “be” with respect to your vocabulary. I do this using the following method – note that this methods has word_to_index (i.e., your vocabulary) as input parameter:

def create_embedding_matrix(self, embeddings_file_name, word_to_index, max_idx, sep=' ', init='zeros', print_each=10000, verbatim=False):
    # Initialize embeddings matrix to handle unknown words
    if init == 'zeros':
        embed_mat = np.zeros((max_idx + 1, self.embed_dim))
    elif init == 'random':
        embed_mat = np.random.rand(max_idx + 1, self.embed_dim)
    else:
        raise Exception('Unknown method to initialize embeddings matrix')

    with open(embeddings_file_name) as infile:
        # Run through each line in the embedding file
        # Usual layout: word coef1 coef2 coef3 ... coefN
        for idx, line in enumerate(infile):
            elem = line.split(sep)
            word = elem[0]
            # If the word is not in the vocabulary, we can skip ot
            if word not in word_to_index:
                continue
            # Get index of current word given the vocabulary
            word_idx = word_to_index[word]
            # Put the pretrained word embedding into the "correct" position of your embedding matrix
            if word_idx <= max_idx:
                embed_mat[word_idx] = np.asarray(elem[1:], dtype='float32')
    # Return embedding matrix
    return embed_mat

Note: max_idx is either the largest index in your vocabulary, or a value between 0 and the largest index in case you want to restrict your vocabulary. Example usage:

embed_mat create_embedding_matrix('glove.840B.300d.txt', word_to_index)

Finally, I use embed_mat to set the weights of the embedding layer of my model:

model.embedding.weight.data.copy_(torch.from_numpy(embed_mat))
if fix_embeddings == True:
    model.embedding.weight.requires_grad=False
else:
    model.embedding.weight.requires_grad=True

I know that nn.Embedding now has a method from_pretrained and there’s also torchtext that probably makes life easier, but I prefer handling these steps “on my own”. Firstly, it’s pretty straightforward, and secondly, it makes it easier to tweak and customize those steps.

I hope that helps and gets you at least on the right track.