Word2vec output with lstm sequence generator

boris.mtdv · October 1, 2019, 12:52pm

I’m trying to train an lstm using pre-trained word2vec vectors as input. The lstm is meant to generate a sequence given the first vector. I have been able to do this by passing the hidden state to the fully connected layer when the FC’s output_features have a dimensionality the size of the length of my vocabulary, and thus the lstm works as a classifier, but I what I want is for the lstm to output a vector with the same dimensions as the pretrained input vectors, so I can then call similar_by_vector and get the nearest word in the vocabulary. I don’t know if in this case it is advisable to use nn.Linear as the fully connected layer. It currently looks like this:

nn.Linear(in_features = self.cfg.lstm.lstm_num_hidden, out_features =self.cfg.lstm.embedding_dim, bias=True)

Is this enough to produce a word2vec vector as output (or at least something close to it)?

Any advice would be much appreciated.

akskuchi · October 1, 2019, 12:59pm

Why not have the LSTM hidden dimension, the same as embedding dim ( your LSTM input dimension)?

boris.mtdv · October 2, 2019, 8:13am

Would that work? I thought the hidden layer’s output had to be put through an additional transformation before it can be considered the “real” output (just an assumption based on what I’ve read). Do you know why so many examples put the hidden layer’s output through nn.Linear()? I haven’t been able to figure out what that layer does, except for resizing the output’s dimensions.

akskuchi · October 2, 2019, 8:43am

I don’t think the hidden layer’s output had to be put through an additional transformation. Usually, implementations related to machine translation, image captioning or similar realms learn a linear transformation, as a second to last step (pre-SoftMax-ing) to map the hidden size to the vocabulary dimensionality (for choosing the most probable prediction).

However, if your objective is to use similar_by_vector for finding the nearest word representation in the vocabulary, I feel you wouldn’t need any additional layers or SoftMax on top of the standard LSTM network.

boris.mtdv · October 2, 2019, 10:15am

Doesn’t the network need to use a fully-connected layer in order for the weights to be updated properly? How would the network be able to update its weights if there is no output to evaluate and give to the loss function? If I understand it correcty, you’re suggesting using the hidden layer’s output to represent the embeddings being provided as input, but it seems the two types of vectors would be generated in a very different way. How would the network “know” to associate one with the other?

akskuchi · October 2, 2019, 10:52am

Doesn’t the network need to use a fully-connected layer in order for the weights to be updated properly?

No, don’t see what you mean properly.

How would the network be able to update its weights if there is no output to evaluate and give to the loss function?

If you meant the LSTM weights, it would depend on what you are doing with the outputs of the LSTM in the first place. Consider a setting, where there aren’t any additional fully connected layers. In such a case, if you take the outputs of the LSTM (hidden representations) and let’s say, compared against word2vec representations of some target words (using a distance metric like cosine similarity - PyTorch). You can do a .backward() on the distances and the network’s weights will update without a problem.

How would the network “know” to associate one with the other?

It depends on what you want the network to associate. If your intention is supervised learning, I believe there exists some form of ground truth at hand, using which you could propagate the loss.

boris.mtdv · October 2, 2019, 1:13pm

In that case would I have to do anything other than setting the dimensions of the hidden layer to be the same as those of my word embeddings? For example, would I have to make sure that the vectors are bounded in the same way?

akskuchi · October 2, 2019, 2:07pm

As far as I know, no. Because consider a case of using pre-trained word2vec representations as input embeddings. They are not bounded to a range.

Nevertheless, you could enforce such a range (like -1 to 1) on both the input and hidden representations by making use of a tanh non-linearity at both ends.