Torchtext Loading Pretrained Glove Vectors

I’m trying to learn how to load pretrained glove vectors using torchtext and I manage to get something to work but I’m confused of what it’s doing.

As an example I have something like this:

MyField.build_vocab(train_data, vectors='glove.6B.100d')

Then in my model I have 100 dimensional embeddings and I load these weights by

pretrained_embeddings = MyField.vocab.vectors
my_model.embedding.weight.data.copy_(pretrained_embeddings)

My Question:
So it’s loading the pretrained embedding matrix which has been trained on a vocabulary of some size, how does that work with a vocabulary that is of different size and different words? My idea is that it should copy the vectors for the words that is in my vocabulary, and randomly initialize the rest, is that what it’s doing?

The performance is actually worse for me when loading the pretrained glove vectors than just training the embedding from scratch. That’s weird and there’s definitely something I’m doing wrong /:

For a pre-trained vocabulary and the vocabs in your datasets for training, it is common that there are
some words in datasets not included in pre-trained vocabulary. To solve this, we can set max_size and unk_init when build_vocab. max_size means the number of most commonly occurred words for building your vocab, and for the rest of words, just use unk_init to represent them, since they are less occurring in the text and unimportant. For example:
TEXT.build_vocab(train_data, max_size=25000, vectors=“glove.6B.100d”, unk_init=torch.Tensor.normal_)
For details, check this https://torchtext.readthedocs.io/en/latest/data.html?highlight=build_vocab#torchtext.data.Field.build_vocab
Also, remember to fix this weight during training.

1 Like