Save and loading vocabulary

Train time:

vectors = vocab.FastText()
self.text.build_vocab(train_data, vectors=vectors, max_size=35000, unk_init=torch.Tensor.normal_)
torch.save(self.text, "vocab")

Test time:

self.text = torch.load( "vocab")

I get this error though while working arr = [[self.vocab.stoi[x] for x in ex] for ex in arr] AttributeError: 'Field' object has no attribute 'vocab'

What am I doing incorrect here? Any ideas?

What type does self.text have? Could it be that you’ve saved a class instance, which might be missing some internals after loading it?

Hi ptrblock,
Thanks for responding! How can I check that?
It is defined as this:

self.text = data.Field(
            tokenize=tokenizer,
            lower=True,
            include_lengths=True,
            preprocessing=generate_n_grams,
        )

After loading it:

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'batch_first', 'build_vocab', 'dtype', 'dtypes', 'eos_token', 'fix_length', 'ignore', 'include_lengths', 'init_token', 'is_target', 'lower', 'numericalize', 'pad', 'pad_first', 'pad_token', 'postprocessing', 'preprocess', 'preprocessing', 'process', 'sequential', 'stop_words', 'tokenize', 'tokenizer_args', 'truncate_first', 'unk_token', 'use_vocab', 'vocab', 'vocab_cls']

It does have vocab, so it is very confusing.
If I stuff like
self.text.vocab.freqs.most_common(20) it seems to work fine.

I’m unfortunately not really familiar with torchtext, but would generally recommend to store only the states/tensors instead of the classes directly.
Would it work, if you only store the vocabulary (assuming it’s a tensor/dict/mapping of some kind) and recreate the data.Field object with it?

Thanks for that suggestion, I am not sure how I do it in this case though.

I load word vectors like this:

vectors = vocab.FastText()
self.text.build_vocab(train_data, vectors=vectors, max_size=vocab_size, unk_init=torch.Tensor.normal_)

The model uses embedding layer:

nn.Embedding(vocab_size, embedding_dim, padding_idx=padding_idx)

In worst case, I can use the train_data again to build vocabulary as well which fine, but if I will lose the modified word vectors. If I save them(vectors) at start using torch.save, the embedding layers I suspect would change the field (self.text) only.

When you save the file, shouldn’t it have a ‘*.pyi’ at the end? I.e. ‘vocab.pyi’.

The save module might not recognize what you are trying to do with “vocab” without having the pickle file extension.