How to save and load "torchtext.data.Field.build_vocab" result

Hello,

How can I save and load the vocabulary of the “build_vocab”?

2 Likes

I would like to bump this. Using torch.load gives an error.

This seems to be a decent workaround:

def save_vocab(vocab, path):
    with open(path, 'w+', encoding='utf-8') as f:     
        for token, index in vocab.stoi.items():
            f.write(f'{index}\t{token}\n')

Then :

def read_vocab(path):
    vocab = dict()
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            index, token = line.split('\t')
            vocab[token] = int(index)
    return vocab

So you first define a vocab object, for example:
words=Field(**args**)

Then after using words.build_vocab(), call:
save_vocab(words.vocab, PATH)

And for loading:
quote.vocab=read_vocab(PATH)

So, finally, you would have:

def save_vocab(vocab, path):
    with open(path, 'w+', encoding='utf-8') as f:     
        for token, index in vocab.stoi.items():
            f.write(f'{index}\t{token}\n')

def read_vocab(path):
    vocab = dict()
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            index, token = line.split('\t')
            vocab[token] = int(index)
    return vocab

words=Field(**args**)
words.build_vocab(dataset, dataset, dataset, ...)
save_vocab(words.vocab, PATH)

words_loaded=Field(**args**)
words_loaded.vocab=read_vocab(PATH)