RuntimeError: inconsistent tensor size while using pre-trained weights

I am trying to use pre-trained word embedding in Pytorch, Wv.p. I have a word2vec pre-trained dataset of 114044 words and my dataset contains 426 unique words.

To use that embedding I load the pickle file and copy the embeddings by:
self.word_embedding.weight.data.copy_(torch.from_numpy(Wv))

but I get an error when running as:
RuntimeError: inconsistent tensor size, expected tensor [426 x 50] and src [114044 x 50] to have the same number of elements, but got 21300 and 5702200 elements respectively at /Users/soumith/code/builder/wheel/pytorch-src/torch/lib/TH/generic/THTensorCopy.c:121

What wrong am I doing? The number of words can obviously be equivalent to pre-trained dataset.

The pretrained embedding was trained on a vocab of 114044 unique words.
But your embedding layer was be initialised for a vocab of 426 words.

I see three possible solutions…

  1. Initialise the embedding layer for a vocab of 114044 and make sure that the 426 words of your new dataset use the word indices from the 114044 word vocabulary.
  2. Extract the embedding data for those 426 words from the 114044 word pretrained embedding.
  3. Train new embeddings
1 Like

Thank you for your answer. The concept is clear to me now.