How to get the vocal quickly from huge list of texts?

Hi!

I have been following the tutorial: Language Translation With Torchtext and in the tutorial they used this function to get a vocab object:

def build_vocab(filepath, tokenizer):
  counter = Counter()
  with io.open(filepath, encoding="utf8") as f:
    for string_ in f:
      counter.update(tokenizer(string_))
  return vocab(counter, specials=['<unk>', '<pad>', '<bos>', '<eos>'])

(By the way, I have changed the Vocab in the tutorial to vocab which is up to date with the current version of torch text)

So I have tried it on my own data which is a text file containing about 10M sentences and it is estimate to end after 3 hours (I’m using the CPU from Colab). I was wondering if there is a more efficient way to do this.

Thank you!

I have changed my function to:

def build_vocab(data_iter, tokenizer):
    def yield_tokens(data_iter, tokenizer):
      for data_sample in data_iter:
        yield tokenizer(data_sample)

    special_symbols = special_symbols_dic.keys()
    vocab = build_vocab_from_iterator(
        yield_tokens(data_iter, tokenizer),
        min_freq = 1,
        specials = ["<unk>", "<pad>", "<bos>", "<eos>"],
        special_first = True
    )
    vocab.set_default_index(0)
    return vocab

I have also changed the tokenizer to:

from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer("spacy", "en_core_web_sm")

And this reduce the necessary time to process my data to about 15minutes.

I’m sure we can optimize even longer with tokenizers from Huggingface and using batching and maybe also use a different vocab object that can be built without iterators but does batching too, but I haven’t checked that out yet.