Torchtext - can we save the processed dataset and fields?

yifanwang · December 17, 2017, 7:36pm

Hi,

I’m currently using torchtext, but I found that creating Datasetobject and calling Field’s build_vocab takes quite a long time, especially when the tokenizer is complicated. However, I failed saving them with pickle. Is there a way that we can save the processed dataset and fields, so that we can speed up data loading?

Thanks.

Johan_pow · February 15, 2019, 11:22am

Hi did you find a good way to do it? I have found this way however,

TEXT = data.Field(sequential=True, tokenize=tokenizer, lower=True,fix_length=200,batch_first=True)
with open("model/TEXT.Field","wb")as f:
     dill.dump(TEXT,f)

from https://stackoverflow.com/questions/53421999/how-to-save-torchtext-dataset

MFajcik1 · April 18, 2019, 5:17pm

But can you currently somehow save dataset?

cb13e917983afd9ad4e7 · June 11, 2020, 9:49am

Currently, Pytorch has provide the interface to save/load the processed vocab. You can directly use the torch.save() and torch.load() to operate the TEXT.Field object.