Aligning torchtext vocab index to loaded embedding pre-trained weights

I have a word2vec model which I loaded the embedded layer with the pretrained weights

However, I’m currently stuck when trying to align the index of the torchtext vocab fields to the same indexes of my pretrained weights

Loaded the pretrained vectors successfully.

model = gensim.models.Word2Vec.load('path to word2vec model')
word_vecs = torch.FloatTensor(model.wv.syn0)
embedding = nn.Embedding(n_embed, embed_dim).from_pretrained(word_vecs)

However, I’m stuck in terms of using torchtext.build_vocab to align or have the same indexes as my word2vec model

i.e. if I do text.build_vocab(training_data)

i could get a an stoi of the following:

<unk> : 0
<pad> : 1
hello: 2
world: 3
bye: 4

but problem is that in my word2vec embedding, the index of the weight are for different strings and their for the weights are for different indexes

i.e. in my word2vec index assuming my dimension are 2

good: 0 => [0.34, 0.56]
bye: 1 => [0.34, 0.47]
day: 2 => [0.98, 0.67]
morning: 3 => [0.43, 0.67]
all: 4 => [0.96, 0.76]
hello: 68 => [0.12, 0.34]
world: 50 => [0.28, 0.96]

So the problem is when the torchtext goes to convert the indexes, because the indexes do not align with the indexes of the word2vec model, the incorrect embeddings are assigned.


sample input = "hello world bye"
torchtext ouput index => [2,3,4]
embedding output => [[0.98, 0.67], [0.43, 0.67], [0.96, 0.76]]

but it should be:
torchtext output index => [68,50,1]
embedding output => [[0.12, 0.34],[0.28, 0.96], [0.34, 0.47]]

I would be grateful for any solutions or suggestions to get this to work properly, I wanted to avoid having to do the word2indx conversion myself and leverage the torchtext build_vocab because it takes care of the padding and unknown token along with many other conveniences.



To load the pretrained embedded vectors generated from genesis to torch text, you need to:

  1. Save embedded vectors by “word2vec” format,
model = gensim.models.Word2Vec(...)
  1. Then load the “word2vec” format by “Vectors” from torch text,
from torchtext.vocab import Vectors
vectors = Vectors(name=model_name, cache=path) # model_name + path = path_to_embeddings_file
  1. You can either provide the embedded vectors when you call build_vocab function or set them later,
# provide the embedded vectors when you call build_vocab function
TEXT = data.Field(
#set embedded vectors later
TEXT.vocab.set_vectors(vectors.stoi, vectors.vectors, vectors.dim)
  1. Initialise nn.Embedding variable with the embedded vectors
embedding = nn.Embedding(n_embed, embed_dim).from_pretrained(TEXT.vocab.vectors)

That’s it. The torchtext vocab index will be aligned to your pretrained vectors’ index.


Is there a way to load pre-trained binary word2vec file from gensim?

Loading vectors form the origininal word2vec text based format sometimes may cause some issues, probably because of how text was tokenized during word2vec training (for example including multi-word tokens) or some charecter encoding issues.

For example with my word2vec model if I do what @Boyu_Zhang suggested vectors = Vectors(name=model_name, cache=path) results to the following error

RuntimeError: Vector for token b'\xc2\xa0' has 129 dimensions, but previously read vectors have 128 dimensions. All vectors must have the same number of dimensions.

But I can load the binary file for the vectors using gensim without a problem, so if someone can load the vectors/weights saved in binary format using gensim and somehow load them to Field.vocab that problem would be solved.


1 Like

This is awesome. Thank you.

Is there a way to decouple this
embedding = nn.Embedding(n_embed, embed_dim).from_pretrained(TEXT.vocab.vectors)
from TEXT? This essentially means your embedding layer is a function of your training data. I want to treat my embedding layer as fixed.

Basically, I want to be able to put
embedding = nn.Embedding(n_embed, embed_dim).from_pretrained(..)
in a nn.Module class that loads my pretrained embeddings.

Here’s what I’ve tried:

dummy_text = data.Field()
dummy_text.vocab.set_vectors(stoi=vectors.stoi, vectors=vectors.vectors, dim=vectors.dim)
embedding = nn.Embedding.from_pretrained(torch.FloatTensor(dummy_text.vocab.vectors))

But dummy_text is not being updated. The output of dummy_text.vocab.itos is ['<unk>', '<pad>']. So obviously when I running anything through my embedding layer I’m getting index out-of-bounds errors.

What’s the correct design pattern here? Do I just keep the embedding layer outside my nn.Module class? What would I do at inference time if the embedding layer is outside my nn.Module class? Do I need to use the training data to reconstruct my embedding layer to make predictions on new, unseen data points? Surely, there is a better way.

Thanks for your help.

you have to give the first argument of build_vocab() function

I’ve got a solution to my issue above. To recap, I want to be able to load up a set of pre-trained embeddings and use all of them for my embedding layer. In other words, my embedding layer has nothing to do with my training dataset. Here’s how I did it.

# fasttext model in w2v format
file_path = '../model/embeds/fasttext_w2vformat_20200508_122722.txt'

# load pre-trained vectors into torch
cache, name = os.path.split(file_path)
# name is the name of the file
# cache is the directory that it lives in (also happens to be where a new file with a .pt extension will be written and cached)
# since the w2v file format sorts word vectors in descending order by their frequency, 
# we can safely use max_vectors to clip the number of vectors used
vectors = torchtext.vocab.Vectors(name=name, cache=cache, max_vectors=50000)

# make sure to save word frequencies when you train your fasttext model
word_counter = Counter(load_json(word_freq_file_path))

# NB: you can pickle this vocab object and use it later too!
vocab = torchtext.vocab.Vocab(word_counter, vectors=vectors)

TEXT = data.Field(batch_first=True)
LABEL = data.Field(sequential=False, unk_token=None)

# now here's where the magic happens (or hack depending on how you look at it)
# the beauty of this is that you don't have to run build_vocab
TEXT.vocab = vocab

# further you can now make use of your pretrained embeddings inside your model by passing them in
# as a parameter or loading the handy pickle file mentioned above - a much better design pattern
# indeed than relying on TEXT.build_vocab() in order to define the embedding layer of your model
embedding_layer = nn.Embedding.from_pretrained(torch.FloatTensor(load_pickle(vocab_file_path).vectors))
1 Like

We wont have the word_counter unless we train those models. I am using gensim’s Word2Vec model. Is there any way to get the mapping. stoi has a mapping. I am passing that token. But I see that it is not using that mapping. For instance, id 1 is used for padding while in the word2vec it refers to the value in any particulars on how to resolve…

Hi Naresh - I saved two crucial artifacts as part of fasttext training in gensim and surely these techniques will apply to word2vec training in gensim as well.

First, I saved the word embeddings in the w2v text file format.

    # save w2v format since this is useful for PyTorch
    if save_w2v:
        w2v_out_filepath = os.path.join(save_dir, f'{file_name}_w2vformat.txt')
        print(f'Saved {w2v_out_filepath}')

Second, I saved the word frequency Counter dictionary that is required by torchtext.vocab.Vocab.

   # save word frequencies since this is useful for PyTorch
    counts = Counter(
        {word: vocab.count
         for (word, vocab) in model.wv.vocab.items()})
    freq_filepath = os.path.join(save_dir, f'{file_name}_word_freq.json')
    save_json(freq_filepath, counts)
    print(f'Saved {freq_filepath}')

Then I load these files into torchtext.vocab.Vectors and torchtext.vocab.Vocab objects, respectively, as described above, and it all comes together beautifully.

I hope that helps!