I’m currently using torchtext, but I found that creating Datasetobject and calling Field’s build_vocab takes quite a long time, especially when the tokenizer is complicated. However, I failed saving them with pickle. Is there a way that we can save the processed dataset and fields, so that we can speed up data loading?
Hi did you find a good way to do it? I have found this way however,
TEXT = data.Field(sequential=True, tokenize=tokenizer, lower=True,fix_length=200,batch_first=True)
with open("model/TEXT.Field","wb")as f:
dill.dump(TEXT,f)
Currently, Pytorch has provide the interface to save/load the processed vocab. You can directly use the torch.save() and torch.load() to operate the TEXT.Field object.