Expected input to torch Embedding layer with pre_trained vectors from gensim

[Cross-post from Stack Overflow]

I would like to use pre-trained embeddings in my neural network architecture. The pre-trained embeddings are trained by gensim. I found this informative answer which indicates that we can load pre_trained models like so:

import gensim
from torch import nn

model = gensim.models.KeyedVectors.load_word2vec_format('path/to/file')
weights = torch.FloatTensor(model.vectors)
emb = nn.Embedding.from_pretrained(torch.FloatTensor(weights.vectors))

This seems to work correctly, also on 1.0.1. My question is, that I don’t quite understand what I have to feed into such a layer to utilise it. Can I just feed the tokens (segmented sentence)? Do I need a mapping, for instance token-to-index?

I found that you can access a token’s vector simply by something like

print(weights['the'])
# [-1.1206588e+00  1.1578362e+00  2.8765252e-01 -1.1759659e+00 ... ]

What does that mean for an RNN architecture? Can we simply load in the tokens of the batch sequences? For instance:

for seq_batch, y in batch_loader():
    # seq_batch is a batch of sequences (tokenized sentences)
    # e.g. [['i', 'like', 'cookies'],['it', 'is', 'raining'],['who', 'are', 'you']]
    output, hidden = model(seq_batch, hidden)

This does not seem to work so I am assuming you need to convert the tokens to its index in the final word2vec model. Is that true? I found that you can get the indices of words by using the word2vec model’s vocab:

weights.vocab['world'].index
# 147

So as an input to an Embedding layer, should I provide a tensor of int for a sequence of sentences that consist of a sequence of words? Example use with dummy dataloader (cf. example above) and dummy RNN welcome.

In principle, there are two things you need to construct:
(1) embedding matrix
(2) word2index dictionary

I don’t think gensim provides a full word2index dictionary directly, which I am not quite sure
Typically, if you are training small corpus, you don’t have to use the whole embedding matrix, because pretrained embedding matrix is very big, which slows your training. What you can do is to extract part of the embeddings and construct the word2index dictionary meanwhile.
Here are my recommended steps:
(1) Construct a vocabulary for your data,
(2) For each token in your vocabulary, query gensim to get embedded vector, add it to your own embedding matrix, and add the token to your own word2index dictionary (note that the index is not the original index in gensim)
(3) Use the word2index dictionary to translate your (text) data to indices.

1 Like

Would it hurt to load a larger pretrained model which contains tokens that are not part of my actual data? Is this bad practice, in any way?

I think in almost all papers they are using pretrained models (2 years ago when i was into nlp that was the case :smiley:)

Would it hurt to load a larger pretrained model which contains tokens that are not part of my actual data? Is this bad practice, in any way?

  • It shouldnt word2vec is learning representation of words and the words that are semantically similar are mapped close to each other so even if u use only part of them u just loss some memory/disk to storage word2vec usless for you variable
    Also some papers that used pretrained models to emmbeding
    https://arxiv.org/pdf/1609.02745.pdf

I won’t say it is a bad practice. As long as you get a embedding matrix and its word2index dictionary, you are good. The problem is that gensim API doesn’t directly output these two things (I am not 100% sure). If the API has these functions, you just call them and get these two things that you really care about.

I found that you can get the indew of a word in gensim’s word2vec like this:

import gensim
from torch import nn

model = gensim.models.KeyedVectors.load_word2vec_format('path/to/file')
# weights to use in from_pretrained()
weights = torch.FloatTensor(model.vectors)
# getting index from word
model.vocab['banana'].index

You are absolutely right in your first post. If you don’t experience any performance problems (like slow training speed, memory issue), you got everything correct.