IMDB split datasets

Niki · October 7, 2019, 1:42am

Hi,

In Torchtext,

train, test = datasets.IMDB.splits(TEXT, LABEL)

divides ratio between train and test 50:50, is there any ways that we change this ratio to 80:20?

ptrblck · October 7, 2019, 3:18am

The ratio for splitting the IMDB dataset originates from the data itself, as 25,000 reviews are provided for training and 25,000 for testing.
From the dataset website:

We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing.

Niki · October 7, 2019, 2:23pm

Understood. Thank you, @ptrblck.

Rohan_Kumar · October 7, 2019, 8:41pm

personally i dont like torchtext all that much, i would recommend you use custom dataloader, and divide the data as you want. I was working on this very dataset, i will upload the code on my github by sunday, you can take a look if you want, i will link it once its up.

Niki · October 8, 2019, 1:40am

That will help a lot, thank you very much @Rohan_Kumar.

Niki · October 9, 2019, 12:37pm

May I ask about Test accuracy for SST dataset, train_data, valid_data, test_data = datasets.SST.splits(TEXT, LABEL) with LSTM? It is around 60%?