How (specifically) does torchtext.data.Field build_vocab function handle assigning integers to tokens when corpus has more words than vocab max_size?

vintagedeek · January 18, 2021, 3:49pm

For example, suppose I have a TEXT Field object and do TEXT.build_vocab(train_set, max_size=50000). My understanding is that build_vocab() finds all unique words in the training set (which has been tokenized by the tokenizer passed to the Field object) and assigns them a unique integer. So in this case, my vocabulary will consist of 49,998 words mapped to unique integers. The other two words are <unk> (index 0) and <pad> (index 1).

Question
What if my train set has 100,000 unique words? Does build_vocab() only assign unique integers to the first 49,998 unique words encountered (and <unk> and <pad>), or does it also consider how frequently a word occurs? I see a counter parameter in the vocab object docs, but not sure how it is used (also see min_freq which makes sense to me).

Abhilash_Srivastava · January 18, 2021, 11:39pm

Great question!
From what I see in these two files: field.py and vocab.py it’s based on the frequency.

    for word, freq in words_and_frequencies:
        if freq < min_freq or len(self.itos) == max_size:
            break
        self.itos.append(word)

The code above ensures itos handles the max_size.

vintagedeek · January 19, 2021, 4:36pm

Thank you for pointing me to this code! Yes, looking at the code it looks like it is based on frequency since the words are sorted by frequency (from most frequent to least frequent) per the first few lines below.

   # sort by frequency, then alphabetically
    words_and_frequencies = sorted(counter.items(), key=lambda tup: tup[0])
    words_and_frequencies.sort(key=lambda tup: tup[1], reverse=True)

    for word, freq in words_and_frequencies:
        if freq < min_freq or len(self.itos) == max_size:
            break
        self.itos.append(word)

Edit: I think my original understanding was slightly off. If you set max_size=50000, then your vocab size should be 50002 for your 50000 most frequently occurring unique terms plus <unk> + <pad>.

From vocab.py,

if specials_first:
    self.itos = list(specials)
    # only extend max size if specials are prepended
    max_size = None if max_size is None else max_size + len(specials)