Hi all. I am new to Pytorch and wanted your help. I have created a word2vec model of a corpus using gensim w2v function. Now I want to feed this model to a Bidirectional lstm. How do I proceed with that? And what exactly is nn.embedding layer? Do I need it?
You essentially always need an embedding layer to map words to vector representations - the network doesnāt understand words, only (vectors/tensors) of numbers. Word embeddings such as Word2Vec or GloVe ensure that these vector representations have already a semantic meaning before ever training the network.
When creating an LSTM network the first layer is usually something like:
self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
You can then, before training, set the weights of embedding layer to yout word vectors like that:
# The shape of the embedding_matrix must be (vocab_size, embedding_dim)
model.word_embeddings.weight.data.copy_(torch.from_numpy(embeddings_matrix))
# Make sure that the weights in the embedding layer are not updated
model.word_embeddings.weight.requires_grad=False
The main task is to create embeddings_matrix
correctly. In my code, I use GloVe word vectors I load from a file. The following snippet just shows the rough idea since all this stuff is custom code.
word_vector_loader = WordVectorLoader() #
word_vector_loader.load_glove('glove.6B.100d.txt') # embedding_dim=100
embeddings_matrix = word_vector_loader.generate_embedding_matrix(vectorizer.vocabulary.word_to_index, 100, max_idx)
Check out torchtext
which might make this all much easier. At least it provides you with pretrained word vectors. If you insist to use your own, you probably have to do prepare them to serve as weights for the embeddings layer yourself.
Itās a bit difficult to be more helpful without seeing any code; I just copied the snippets from one of my projects.
Thank you Chris for the help especially the requires_grad part.
So I have made some changes and these are the steps I followed:
1: model.save('w2v.model')
# which persists the word2vec model I created using gensim
2: model = Word2Vec.load('w2v.model')
# loading the model
3:
weights = torch.FloatTensor(model.wv.vectors)
embedding = nn.Embedding.from_pretrained(weights)
Does these steps seem correct(I havenāt added the requires_grad yet)? The w2v dimension is 200
I cannot test the code but it looks alright.
You only have to make sure that the input sequences match the embedding. For example, if you convert the sentence āi go to work every dayā into the input sequence [4, 24, 8, 120, 53, 78, 0, 0, 0, 0]
with 0 representing padding, so that 4 represents āIā, 24 represents āgoā and so onā¦then that must match in the embedding matrix, i.e., the forth word vector is the one for āiā. See this other post discussing that issue.
Hi Chris,
I follow your instruction but I got a problem⦠I didnāt find any solution. I hope that you can help me.
So basically the code for me is like this :
w2v_model = gensim.models.Word2Vec.load(ā./model/word2vec.modelā)
w2v_weights = w2v_model.wv.vectors
so the weight is shape (81505,100 )
so I put my model like this :
class LSTM(nn.Module):
āāādocstring for LSTMāāā
def init(self):
super(LSTM, self).init()
self.word_embeddings = nn.Embedding(81505, 100)
self.lstm = nn.LSTM(500, 300,2,bidirectional = True)
self.dropout = nn.Dropout(0.5)
self.dense = nn.Linear(200, 5)
self.act = nn.ReLU()def forward(self,x):
text_emb = self.word_embeddings(x)
lstm_out, lstm_hidden = self.lstm(text_emb)
lstm_out = lstm_out[:,-1,:]
lstm_out = self.act(lstm_out)
drop_out = self.dropout(lstm_out)
output = self.dense(drop_out)
return output
But then after begin the training I got this error :
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
Sorry if my question is too obvious. Iām really new with NLP.
Hm, Iām not sure how to interpret the error. It seem to come from a torch package? If your aim is to get your code running, I would try the following to simplify the model ā makes debugging easier:
-
Name your class
MyLSTM
. Itās probably not an issue but thereās not need to risk naming problems :). -
Use just one layer and unidirectional:
self.lstm = nn.LSTM(500, 300,1,bidirectional=False)
. Otherwiselstm_out = lstm_out[:,-1,:]
might yield unexpected results -
Remove (comment out)
self.dropout
andself.act
.nn.LSTM
already use activation functions, an Dropout is not needed to get it working, ājustā to improve the results. -
Since you donāt show the code, Iām not sure if you set the pretrained word vectors. Anyway, again, they are not needed to get a working model for a start
-
The
hidden_dim
of younn.LSTM
is 300, but you definenn.Linear(200, 5)
. How do you get from 300 dimensions to 200 for the linear layer? -
Can you change your
forward()
method to
def forward(self,x):
print(x.shape)
text_emb = self.word_embeddings(x)
print(text_emb.shape)
lstm_out, lstm_hidden = self.lstm(text_emb)
lstm_out = lstm_out[:,-1,:]
print(lstm_out.shape)
output = self.dense(drop_out)
return output
Thank you for your answer.
I think maybe I did use the word2vec weight to convert my words before feeding to the LSTM. So I believe that my input already been embedding? So I donāt need the embedding layer anymore. Iām not so sure if that is correct.
Here is my code to convert data before input to the model :
for sentence in txt_sequence:
temp =
for word in sentence :
temp.append(w2v_model.wv[word]) ## this will extract the vector of the word from weight of w2v
x.append(temp)
After that, I pad my x with padding-zero provide from Keras. So I did remove the embedding layers and everything work. But Iām not sure is it the most efficient way.