Refer to here: torchtext.vocab — Torchtext 0.13.0 documentation
build_vocab_from_iterator() is very quick at getting tokens and indices into a Vocab object when fed a list of words or iterator of word lists.
But what if I need to add another large list of words to that vocab object? I’ve used
vocab.append(possibly_new_token) but that seems to only work with an iterator and is very slow for a large list of words. Is there a way to either:
a. Add a new list of tokens(and ignore existing), or
b. Combine two vocab objects into one, while ignoring duplicates?
Would it be possible to get all the words from a built vocabulary into a list, merge this list to the new list, and then call
build_vocab_from_iterator() on this new list?
I’m not seeing a vocab method to get the list out or manipulate it.
Currently, just making do with the smallest pre-trained GloVe embedding vector, which is still much larger than I’d like.
Am looking for a more memory-efficient embedding system. Bonus points if it can leverage phenomes, first, and connect that to a reduced embedding vector. I believe such a system would have a more intuitive “understanding” of language, given how written language was derived from oral language.