Torchtext - can we save the processed dataset and fields?

Hi,

I’m currently using torchtext, but I found that creating Datasetobject and calling Field’s build_vocab takes quite a long time, especially when the tokenizer is complicated. However, I failed saving them with pickle. Is there a way that we can save the processed dataset and fields, so that we can speed up data loading?

Thanks.

3 Likes

Hi did you find a good way to do it? I have found this way however,

TEXT = data.Field(sequential=True, tokenize=tokenizer, lower=True,fix_length=200,batch_first=True)
with open("model/TEXT.Field","wb")as f:
     dill.dump(TEXT,f)

from https://stackoverflow.com/questions/53421999/how-to-save-torchtext-dataset

But can you currently somehow save dataset?

Currently, Pytorch has provide the interface to save/load the processed vocab. You can directly use the torch.save() and torch.load() to operate the TEXT.Field object.