Vocab + Vectors: questions on simultaneous usage in pipeline

  • What is a format for a file with cached pre-trained vectors for the Vectors object? (I will do pre-training myself using fasttext). Is it the same as fasttext uses?

Example from fasttext docs:

218316 100
the -0.10363 -0.063669 0.032436 -0.040798 0.53749 0.00097867 0.10083 0.24829 ...
of -0.0083724 0.0059414 -0.046618 -0.072735 0.83007 0.038895 -0.13634 0.60063 ...
one 0.32731 0.044409 -0.46484 0.14716 0.7431 0.24684 -0.11301 0.51721 0.73262 ...
  • If the format is as shown above, should I just skip the vocab stage and put tokens (after tokenization) straight trough Vocab object to get token vectors? Or should I turn tokens to ints before the tasttext pre-training and then use ints instead of tokens in a vectors file? That way I will make use of the Vocab object.

What is the intended way to do it? What was the intended use case for Vectors objects? is it to skip vocab entierly, or do you guys expext to have Vectors objects with tokens being ints?

Thank you!