Combine train and test set (torchtext) but ConcatDataset object has no attribute "get_vocab"

EmreTokyuez · August 7, 2020, 11:59pm

I want to use the examples in the test set of the IMDB Sentiment Analysis Dataset for training, as I have built my own benchmark with which I will compare the performance of various Models (my Matura Thesis)

So after trying, I got the appending working and also managed ot split it, so that I have a validation set as well. The code is the following:

from torchtext.experimental.datasets import IMDB

from torchtext.data.utils import get_tokenizer

tokenizer = get_tokenizer("spacy")

train_data, test_data = IMDB(tokenizer = tokenizer)
train_data, valid_data = torch.utils.data.random_split(train_data, [17500,7500])

Number of training examples: 17500
Number of validation examples: 7500
Number of testing examples: 25000

Now I want to merge the train and the test set and leave the validation set alone, so I do this:

from torch.utils.data import ConcatDataset

train_data = ConcatDataset([train_data, test_data])

print(f’Number of training examples: {len(train_data)}’)

Number of training examples: 42500

But when I try to get the vocabulary, I get the following error:

vocab = train_dataset.get_vocab()
**AttributeError** Traceback (most recent call last) in **----> 1** vocab **=** train_dataset **.** get_vocab **(** **)** **AttributeError** : 'ConcatDataset' object has no attribute 'get_vocab'

How can I solve this?

ptrblck · August 10, 2020, 9:11am

random_split will return Subsets, which wrap the dataset.
To access the underlying dataset, you could use train_dataset.dataset.get_vocab().
However, this would call the get_vocab() methon of course on the complete dataset.
I’m not familiar with this method, but if it creates the vocabulary based on the used dataset, it would contain the words from the training and validation dataset.