Hi all. I am new to Pytorch and wanted your help. I have created a word2vec model of a corpus using gensim w2v function. Now I want to feed this model to a Bidirectional lstm. How do I proceed with that? And what exactly is nn.embedding layer? Do I need it?
You essentially always need an embedding layer to map words to vector representations - the network doesn’t understand words, only (vectors/tensors) of numbers. Word embeddings such as Word2Vec or GloVe ensure that these vector representations have already a semantic meaning before ever training the network.
When creating an LSTM network the first layer is usually something like:
self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
You can then, before training, set the weights of embedding layer to yout word vectors like that:
# The shape of the embedding_matrix must be (vocab_size, embedding_dim) model.word_embeddings.weight.data.copy_(torch.from_numpy(embeddings_matrix)) # Make sure that the weights in the embedding layer are not updated model.word_embeddings.weight.requires_grad=False
The main task is to create
embeddings_matrix correctly. In my code, I use GloVe word vectors I load from a file. The following snippet just shows the rough idea since all this stuff is custom code.
word_vector_loader = WordVectorLoader() # word_vector_loader.load_glove('glove.6B.100d.txt') # embedding_dim=100 embeddings_matrix = word_vector_loader.generate_embedding_matrix(vectorizer.vocabulary.word_to_index, 100, max_idx)
torchtext which might make this all much easier. At least it provides you with pretrained word vectors. If you insist to use your own, you probably have to do prepare them to serve as weights for the embeddings layer yourself.
It’s a bit difficult to be more helpful without seeing any code; I just copied the snippets from one of my projects.
Thank you Chris for the help especially the requires_grad part.
So I have made some changes and these are the steps I followed:
model.save('w2v.model') # which persists the word2vec model I created using gensim
model = Word2Vec.load('w2v.model') # loading the model
weights = torch.FloatTensor(model.wv.vectors) embedding = nn.Embedding.from_pretrained(weights)
Does these steps seem correct(I haven’t added the requires_grad yet)? The w2v dimension is 200
I cannot test the code but it looks alright.
You only have to make sure that the input sequences match the embedding. For example, if you convert the sentence “i go to work every day” into the input sequence
[4, 24, 8, 120, 53, 78, 0, 0, 0, 0] with 0 representing padding, so that 4 represents “I”, 24 represents “go” and so on…then that must match in the embedding matrix, i.e., the forth word vector is the one for “i”. See this other post discussing that issue.
I follow your instruction but I got a problem… I didn’t find any solution. I hope that you can help me.
So basically the code for me is like this :
w2v_model = gensim.models.Word2Vec.load(’./model/word2vec.model’)
w2v_weights = w2v_model.wv.vectors
so the weight is shape (81505,100 )
so I put my model like this :
“”“docstring for LSTM”""
self.word_embeddings = nn.Embedding(81505, 100)
self.lstm = nn.LSTM(500, 300,2,bidirectional = True)
self.dropout = nn.Dropout(0.5)
self.dense = nn.Linear(200, 5)
self.act = nn.ReLU()
text_emb = self.word_embeddings(x)
lstm_out, lstm_hidden = self.lstm(text_emb)
lstm_out = lstm_out[:,-1,:]
lstm_out = self.act(lstm_out)
drop_out = self.dropout(lstm_out)
output = self.dense(drop_out)
But then after begin the training I got this error :
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
Sorry if my question is too obvious. I’m really new with NLP.
Hm, I’m not sure how to interpret the error. It seem to come from a torch package? If your aim is to get your code running, I would try the following to simplify the model – makes debugging easier:
Name your class
MyLSTM. It’s probably not an issue but there’s not need to risk naming problems :).
Use just one layer and unidirectional:
self.lstm = nn.LSTM(500, 300,1,bidirectional=False). Otherwise
lstm_out = lstm_out[:,-1,:]might yield unexpected results
Remove (comment out)
nn.LSTMalready use activation functions, an Dropout is not needed to get it working, “just” to improve the results.
Since you don’t show the code, I’m not sure if you set the pretrained word vectors. Anyway, again, they are not needed to get a working model for a start
nn.LSTMis 300, but you define
nn.Linear(200, 5). How do you get from 300 dimensions to 200 for the linear layer?
Can you change your
def forward(self,x): print(x.shape) text_emb = self.word_embeddings(x) print(text_emb.shape) lstm_out, lstm_hidden = self.lstm(text_emb) lstm_out = lstm_out[:,-1,:] print(lstm_out.shape) output = self.dense(drop_out) return output
Thank you for your answer.
I think maybe I did use the word2vec weight to convert my words before feeding to the LSTM. So I believe that my input already been embedding? So I don’t need the embedding layer anymore. I’m not so sure if that is correct.
Here is my code to convert data before input to the model :
for sentence in txt_sequence:
temp = 
for word in sentence :
temp.append(w2v_model.wv[word]) ## this will extract the vector of the word from weight of w2v
After that, I pad my x with padding-zero provide from Keras. So I did remove the embedding layers and everything work. But I’m not sure is it the most efficient way.