LSTM for word prediction

I’m in trouble with the task of predicting the next word given a sequence of words with a LSTM model. I built the embeddings with Word2Vec for my vocabulary of words taken from different books. I create a list with all the words of my books (A flatten big book of my books). I created sequences of sentences of length N (with N fixed, for example sequences of length 6) and i shuffled these lists for the model in order to create training set, validation set and test set.

Now I’m a bit confused. Given a sentence, the network should predict each element of the sequence, so if i give the sentence “The cat is on the table with Anna”, the network takes “The” and try to predict “Cat” which is part of the sentence, so there is a ground truth, and so on

Is this procedure correct? I don’t know how to implement it with Pytorch.

REUP and change the question.

I uploaded the word matrix of embeddings inside the LSTM as follows:

weights = torch.FloatTensor(np.load('embeddings.npy'))```

class lstm(nn.Module):

def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size, matrix_embeddings):
    super(lstm, self).__init__()
    #dimensionalities
    self.embedding_dim = embedding_dim
    self.vocab_size = vocab_size
    self.hidden_dim = hidden_dim

    #embedding
    self.embeddings = nn.Embedding.from_pretrained(matrix_embeddings)
    self.embeddings.requires_grad = False

    # The LSTM takes word embeddings as inputs, and outputs hidden states
    # with dimensionality hidden_dim.
    self.lstm = nn.LSTM(embedding_dim, hidden_dim)

    # The linear layer that maps from hidden state space to tag space
    self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

def forward(self, sentence):
    embeds = self.word_embeddings(sentence)
    lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
    tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
    tag_scores = F.log_softmax(tag_space, dim=1)
    return tag_scores```

Now I have some questions. First questions is how I have to proceed? How can the LSTM know that the word “Home” belongs to the index 787 of the embeddings matrix?
I don’t understand how to build an LSTM network for word prediction and i need your help guys

When dealing with texts, you usually first build your own Vocabulary to map between words and indexes, and vice versa. This could be a utility class like like this (this is a class I wrote myself):

class Vocabulary:

    def __init__(self, default_indexes={}):
        self.default_indexes = {**default_indexes}
        self.init()

    def init(self):
        self.index_to_word = {**self.default_indexes}
        self.word_to_index = {}
        self.word_counts = {}
        self.num_words = len(self.default_indexes)
        for idx, word in self.index_to_word.items():
            self.word_to_index[word] = idx

    def index_words(self, word_list):
        for word in word_list:
            self.index_word(word)

    def index_word(self, word, cnt=None):
        if word not in self.word_to_index:
            self.index_to_word[len(self.index_to_word)] = word
            self.word_to_index[word] = len(self.word_to_index)
            if cnt is None:
                self.word_counts[word] = 1
                self.num_words += 1
            else:
                self.word_counts[word] = cnt
                self.num_words += cnt
        else:
            if cnt is None:
                self.word_counts[word] += 1
            else:
                self.word_counts[word] += cnt

    def get_words(self, indices):
        return [self.index_to_word[i] if i in self.index_to_word else None for i in indices ]

# Testing
vocabulary = Vocabulary(default_indexes={0: '<pad>', 1: '<unk>'})
print(vocabulary.index_to_word)
vocabulary.index_word('test')
print(vocabulary.index_to_word)

Essentially, you now have the dictionary self.word_to_index that maps a word in your dataset to an index, e.g.:

self.word_to_index ={0: '<pad>', 1: '<unk>', 2: 'and', 3: 'I', 4: 'the', 5: 'be', ...}

Given a sentence “I will be tired and exhausted”, you can use this dictionary to convert this sentence into a tensors, e.g., input = [3, 73, 5, 310, 2, 511] (maybe with padding in case of batches). Now, input is what you give to ‘self.embeddings’ – you do not five the embedding layer words!

In case you use pre-trained word embeddings, yes, you have to make sure that the embedding at position, say, 5 does indeed represent the word “be” with respect to your vocabulary. I do this using the following method – note that this methods has word_to_index (i.e., your vocabulary) as input parameter:

def create_embedding_matrix(self, embeddings_file_name, word_to_index, max_idx, sep=' ', init='zeros', print_each=10000, verbatim=False):
    # Initialize embeddings matrix to handle unknown words
    if init == 'zeros':
        embed_mat = np.zeros((max_idx + 1, self.embed_dim))
    elif init == 'random':
        embed_mat = np.random.rand(max_idx + 1, self.embed_dim)
    else:
        raise Exception('Unknown method to initialize embeddings matrix')

    with open(embeddings_file_name) as infile:
        # Run through each line in the embedding file
        # Usual layout: word coef1 coef2 coef3 ... coefN
        for idx, line in enumerate(infile):
            elem = line.split(sep)
            word = elem[0]
            # If the word is not in the vocabulary, we can skip ot
            if word not in word_to_index:
                continue
            # Get index of current word given the vocabulary
            word_idx = word_to_index[word]
            # Put the pretrained word embedding into the "correct" position of your embedding matrix
            if word_idx <= max_idx:
                embed_mat[word_idx] = np.asarray(elem[1:], dtype='float32')
    # Return embedding matrix
    return embed_mat

Note: max_idx is either the largest index in your vocabulary, or a value between 0 and the largest index in case you want to restrict your vocabulary. Example usage:

embed_mat create_embedding_matrix('glove.840B.300d.txt', word_to_index)

Finally, I use embed_mat to set the weights of the embedding layer of my model:

model.embedding.weight.data.copy_(torch.from_numpy(embed_mat))
if fix_embeddings == True:
    model.embedding.weight.requires_grad=False
else:
    model.embedding.weight.requires_grad=True

I know that nn.Embedding now has a method from_pretrained and there’s also torchtext that probably makes life easier, but I prefer handling these steps “on my own”. Firstly, it’s pretty straightforward, and secondly, it makes it easier to tweak and customize those steps.

I hope that helps and gets you at least on the right track.

Thank you for your answer. I actually found my own solution to produce the vector of indexes of the words like you input tensor. I processed 8 different books, I created a flat corpus of all the books and then I decided to create fixed sentences of 8 words each. I have done a shuffe of these sentences in order to create training validation and test. I built a method which take the sentence and converts it into a tensor of indexes witch is passed to the LSTM but I’m missing a crucial point. Suppose that the sentence is “The dog is on the table” which converts to [10,0,34,983,8,12]. Then i pass this vector to the LSTM. My objective is to predict for the word ‘The’ the word ‘dog’ for the word ‘dog’ the word ‘is’ and so on. Is there an efficient way to this? I have to create all the couples? what’s the output of the LSTM? I’m missing these concepts.

Also I noticed that the LSTM gives me in output a tensor with a length equal to the length of input tensor while I’m expeting in output a tensor of length = len(input) - 1 since the last word does not have a ground truth.

Also, how long need to be each tensor in output? Actually each output tensor has a length equal to the vocabulary because I thought that I use softmax fo selecting the index with the highest probability. Is that correct?

Final question, I obtained tensors of different length because some words are not present in my vocabulary, which is the best solution for having tensors of the same length?

UPDATE

I found a temporary solution for the previous problems. Now I’m feeding the LSTM with sequences of vectors with fixed length = 7. Using a batch_size = 256 I have in input to the model a tensor of size [256,7]. For simplicity I’m using for the moment a batch_size of 1 so I have [1,7]. Each of my embedding has a length of 200, but when i give in input for example a tensor like [1,0,1,89,177,7,7] which are the indexes of the words in the embeddings matrix it gives me the following error:

Expected target size (1, 200), got torch.Size([1, 7])

I set input_dim = 200 and layers_num = 7 because i need to predict seven words sequentially, but I think that I miss understood the meaning of these parameters

My current LSTM

class lstm(nn.Module):

    def __init__(self, input_size, hidden_units, layers_num, matrix_embeddings, dropout_prob=0):
        
        super().__init__()
        #embedding
        self.embeddings = nn.Embedding.from_pretrained(matrix_embeddings)
        self.embeddings.requires_grad = False

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.rnn = nn.LSTM(input_size=input_size, 
                           hidden_size=hidden_units,
                           num_layers=layers_num,
                           dropout=dropout_prob,
                           batch_first=True)

        # The linear layer that maps from hidden state space to tag space
        self.out = nn.Linear(hidden_units, input_size)

    def forward(self, sentence, state=None):
        embeds = self.embeddings(sentence)
        # LSTM
        x, rnn_state = self.rnn(embeds, state)
        # Linear layer
        x = self.out(x)
        return x, rnn_state

UPDATE: The network is working, in the sense that it does not give me other types of errors, but I have another big problem. After 1000/1500 epochs it start overfitting. 2 hidden layers with 128 units, inputs are sentences of 30 words each. I’m training with word embeddings of 64 dimensions, and with the 5 books of games of thrones.
if for example the seed after training is “Hodor is” the LSTM returns me something like this 'Hodor is a man of the night watch i am not a man of the night watch i am not a man of the night watch i am not a man of the night watch i am not a man of the night watch i am not a man of the night watch i am not a man of the night watch i am not a man of the night watch i am not a man of the night watch i am not a man of the night watch i am not a man of the night watch i am not a man of the night watch…

Any suggestion about a possible problem?

The word language modeling link is a relevant example to predict next work.
To build vocab on multiple books, yes, you are right to put the sentences together in corpus. If you like, you could also use the vocab class in torchtext. An example to build vocab based on wikipedia is here

It was a problem with the weight decay factor and learning rate, it works!