How to use custom word2vec embedding in torchtext.vocab Vectors

Sirui_Li · July 10, 2019, 3:07am

I train a custom word2vec embedding file named “word2vec.txt” and I would like to use it in TEXT.build_vocab(train_data, vectors=Vectors("word2vec.txt")) where train_data is my training data in torchtext Dataset.

But I got this issue:

Vector for token b’\xc2\xa0’ has 301 dimensions, but previously read vectors have 300 dimensions. All vectors must have the same number of dimensions.

I have checked my embedding file. All vectors are 300 dimensions. If I change the embedding file to pre-trained glove file, it works without any issue.

Sirui_Li · July 10, 2019, 3:49am

I resolve it by reading embedding file in “rb” and search the non-break space (b’\xc2\xa0’ ) to delete them. I think the most effective way is to preprocess the training data for word2vec file.