IMDB split datasets


In Torchtext,

train, test = datasets.IMDB.splits(TEXT, LABEL)

divides ratio between train and test 50:50, is there any ways that we change this ratio to 80:20?

The ratio for splitting the IMDB dataset originates from the data itself, as 25,000 reviews are provided for training and 25,000 for testing.
From the dataset website:

We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing.

1 Like

Understood. Thank you, @ptrblck.

personally i dont like torchtext all that much, i would recommend you use custom dataloader, and divide the data as you want. I was working on this very dataset, i will upload the code on my github by sunday, you can take a look if you want, i will link it once its up.

1 Like

That will help a lot, thank you very much @Rohan_Kumar.

May I ask about Test accuracy for SST dataset, train_data, valid_data, test_data = datasets.SST.splits(TEXT, LABEL) with LSTM? It is around 60%?