I want to use the examples in the test set of the IMDB Sentiment Analysis Dataset for training, as I have built my own benchmark with which I will compare the performance of various Models (my Matura Thesis)
So after trying, I got the appending working and also managed ot split it, so that I have a validation set as well. The code is the following:
from torchtext.experimental.datasets import IMDB
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer("spacy")
train_data, test_data = IMDB(tokenizer = tokenizer)
train_data, valid_data = torch.utils.data.random_split(train_data, [17500,7500])
Number of training examples: 17500
Number of validation examples: 7500
Number of testing examples: 25000
Now I want to merge the train and the test set and leave the validation set alone, so I do this:
from torch.utils.data import ConcatDataset
train_data = ConcatDataset([train_data, test_data])
print(f’Number of training examples: {len(train_data)}’)
Number of training examples: 42500
But when I try to get the vocabulary, I get the following error:
vocab = train_dataset.get_vocab()
**AttributeError** Traceback (most recent call last) in **----> 1** vocab **=** train_dataset **.** get_vocab **(** **)** **AttributeError** : 'ConcatDataset' object has no attribute 'get_vocab'
How can I solve this?