Quick Way to Append Batch Tokens to Vocab object

J_Johnson · January 11, 2023, 5:42am

Refer to here: torchtext.vocab — Torchtext 0.13.0 documentation

build_vocab_from_iterator() is very quick at getting tokens and indices into a Vocab object when fed a list of words or iterator of word lists.

But what if I need to add another large list of words to that vocab object? I’ve used vocab.append(possibly_new_token) but that seems to only work with an iterator and is very slow for a large list of words. Is there a way to either:

a. Add a new list of tokens(and ignore existing), or
b. Combine two vocab objects into one, while ignoring duplicates?

dreidizzle · January 16, 2023, 5:07pm

Would it be possible to get all the words from a built vocabulary into a list, merge this list to the new list, and then call build_vocab_from_iterator() on this new list?

J_Johnson · January 17, 2023, 3:39am

I’m not seeing a vocab method to get the list out or manipulate it.

Currently, just making do with the smallest pre-trained GloVe embedding vector, which is still much larger than I’d like.

Am looking for a more memory-efficient embedding system. Bonus points if it can leverage phenomes, first, and connect that to a reduced embedding vector. I believe such a system would have a more intuitive “understanding” of language, given how written language was derived from oral language.