How can I Construct a vocabulary from the train and test datasets

Alston · May 10, 2021, 8:32am

I want to build a vocabulary from my training and test datasets using the torchtext. Of course, I can do it as follow:

TEXT.build_vocab(train, test)

where TEXT is a Field object, and the train and test are Dataset objects. But I do not have enough memory to load training and test datasets at the same time. When I performed like this:

TEXT.build_vocab(train)
del train
TEXT.build_vocab(test)
del test

it only builds the vocab from the test data.
How can I build the vocab in 2 steps so that I can release the memory after I create the corresponding vocab?

mmg · June 21, 2021, 3:56am

I am not sure why you are building vocab from the test data. Only the train data is used for building vocabulary. Any words present in test but missing in train must be rendered as unknown