How to use torchtext to build a vocabulary with binary file such as 'GoogleNews-vectors-negative300.bin'?

catqaq · March 25, 2019, 1:31pm

I can use the word vector model in txt format as follows:

if not os.path.exists(.vector_cache):
    os.mkdir(.vector_cache)
vectors = Vectors(name='myvector/glove/glove.6B.200d.txt')
TEXT.build_vocab(train, vectors=vectors)

However, when i turn to the binary format such as googlenews-vectors-negative300.bin, I got an error: could not convert string to float.
The code is almost the same as above ：

if not os.path.exists(.vector_cache):
    os.mkdir(.vector_cache)
vectors = Vectors(name='GoogleNews-vectors-negative300.bin')
TEXT.build_vocab(train, vectors=vectors)

so, how to use the word vector model in binary format to build a vocab?
In addition, should we use the vocabulary of the pre-trained model directly, or build a vocabulary from the training set, or build a vocabulary from the training set + test set? I am very confused about this.

Any help will be grateful！

sai_m · December 11, 2019, 11:41pm

I have faced same issue with bin files. What I have done is, I have converted bin to text format and it started working.

When it comes to building custom embeddings, i think it depends on few factors like, how much data you have to generate good embeddings? Does available embeddings doesn’t serve the purpose?

Unless you data is very specific, like medical or related to something specific, which already available embeddings doesn’t have then you can train using gensim(which is very simple) to get your own embeddings.

However, I would suggest to play around glove or fasttext embedding using few custom words you have in your dataset and see the outputs before making any decision on using custom or available embeddings.