Word2Vec as input to lstm

Hi all. I am new to Pytorch and wanted your help. I have created a word2vec model of a corpus using gensim w2v function. Now I want to feed this model to a Bidirectional lstm. How do I proceed with that? And what exactly is nn.embedding layer? Do I need it?

You essentially always need an embedding layer to map words to vector representations - the network doesn’t understand words, only (vectors/tensors) of numbers. Word embeddings such as Word2Vec or GloVe ensure that these vector representations have already a semantic meaning before ever training the network.

When creating an LSTM network the first layer is usually something like:

self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

You can then, before training, set the weights of embedding layer to yout word vectors like that:

# The shape of the embedding_matrix must be (vocab_size, embedding_dim)
model.word_embeddings.weight.data.copy_(torch.from_numpy(embeddings_matrix))

# Make sure that the weights in the embedding layer are not updated
model.word_embeddings.weight.requires_grad=False

The main task is to create embeddings_matrix correctly. In my code, I use GloVe word vectors I load from a file. The following snippet just shows the rough idea since all this stuff is custom code.

word_vector_loader = WordVectorLoader() #
word_vector_loader.load_glove('glove.6B.100d.txt') # embedding_dim=100
embeddings_matrix = word_vector_loader.generate_embedding_matrix(vectorizer.vocabulary.word_to_index, 100, max_idx)

Check out torchtext which might make this all much easier. At least it provides you with pretrained word vectors. If you insist to use your own, you probably have to do prepare them to serve as weights for the embeddings layer yourself.

It’s a bit difficult to be more helpful without seeing any code; I just copied the snippets from one of my projects.

2 Likes

Thank you Chris for the help especially the requires_grad part.
So I have made some changes and these are the steps I followed:
1: model.save('w2v.model') # which persists the word2vec model I created using gensim
2: model = Word2Vec.load('w2v.model') # loading the model
3:

 weights = torch.FloatTensor(model.wv.vectors)
embedding = nn.Embedding.from_pretrained(weights)

Does these steps seem correct(I haven’t added the requires_grad yet)? The w2v dimension is 200

1 Like

I cannot test the code but it looks alright.

You only have to make sure that the input sequences match the embedding. For example, if you convert the sentence “i go to work every day” into the input sequence [4, 24, 8, 120, 53, 78, 0, 0, 0, 0] with 0 representing padding, so that 4 represents “I”, 24 represents “go” and so on…then that must match in the embedding matrix, i.e., the forth word vector is the one for “i”. See this other post discussing that issue.

1 Like

Hi Chris,
I follow your instruction but I got a problem… I didn’t find any solution. I hope that you can help me.

So basically the code for me is like this :

w2v_model = gensim.models.Word2Vec.load(’./model/word2vec.model’)
w2v_weights = w2v_model.wv.vectors

so the weight is shape (81505,100 )

so I put my model like this :

class LSTM(nn.Module):
“”“docstring for LSTM”""
def init(self):
super(LSTM, self).init()
self.word_embeddings = nn.Embedding(81505, 100)
self.lstm = nn.LSTM(500, 300,2,bidirectional = True)
self.dropout = nn.Dropout(0.5)
self.dense = nn.Linear(200, 5)
self.act = nn.ReLU()

def forward(self,x):
text_emb = self.word_embeddings(x)
lstm_out, lstm_hidden = self.lstm(text_emb)
lstm_out = lstm_out[:,-1,:]
lstm_out = self.act(lstm_out)
drop_out = self.dropout(lstm_out)
output = self.dense(drop_out)
return output

But then after begin the training I got this error :

return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

IndexError: index out of range in self

Sorry if my question is too obvious. I’m really new with NLP.

Hm, I’m not sure how to interpret the error. It seem to come from a torch package? If your aim is to get your code running, I would try the following to simplify the model – makes debugging easier:

  • Name your class MyLSTM. It’s probably not an issue but there’s not need to risk naming problems :).

  • Use just one layer and unidirectional: self.lstm = nn.LSTM(500, 300,1,bidirectional=False). Otherwise lstm_out = lstm_out[:,-1,:] might yield unexpected results

  • Remove (comment out) self.dropout and self.act. nn.LSTM already use activation functions, an Dropout is not needed to get it working, “just” to improve the results.

  • Since you don’t show the code, I’m not sure if you set the pretrained word vectors. Anyway, again, they are not needed to get a working model for a start

  • The hidden_dim of you nn.LSTM is 300, but you define nn.Linear(200, 5). How do you get from 300 dimensions to 200 for the linear layer?

  • Can you change your forward() method to

def forward(self,x):
    print(x.shape)
    text_emb = self.word_embeddings(x)
    print(text_emb.shape)
    lstm_out, lstm_hidden = self.lstm(text_emb)
    lstm_out = lstm_out[:,-1,:]
    print(lstm_out.shape)
    output = self.dense(drop_out)
    return output
1 Like

Thank you for your answer.

I think maybe I did use the word2vec weight to convert my words before feeding to the LSTM. So I believe that my input already been embedding? So I don’t need the embedding layer anymore. I’m not so sure if that is correct.

Here is my code to convert data before input to the model :

for sentence in txt_sequence:
temp = []
for word in sentence :
temp.append(w2v_model.wv[word]) ## this will extract the vector of the word from weight of w2v
x.append(temp)

After that, I pad my x with padding-zero provide from Keras. So I did remove the embedding layers and everything work. But I’m not sure is it the most efficient way.