Stratified split of data set


(Hrishikesh Menon) #1

Hi,

I know that most people prefer to create separate data sets for training and testing. However, can we perform a stratified split on a data set? By ‘stratified split’, I mean that if I want a 70:30 split on the data set, each class in the set is divided into 70:30 and then the first part is merged to create data set 1 and the second part is merged to create data set 2. While I have seen random splits (like kevinzakka’s script ), I have not seen an example of stratified split yet.

Continuing this, is there a way to access all elements of a single class once the data has been dumped to a data set?

Thank you,
Richukuttan


(Solomon K ) #2

look here:


Section named “train validation split”


(Hrishikesh Menon) #3

Please explain how this ensures we get equal split of the examples of each class. From what I understood, the only way that is possible is by carefully arranging the entries at the csv file. If I randomize the entries, the output also becomes randomized. So, it is within the realm of possibility that a particular class gets no training examples, as all of its examples lie after the split point. If it is indeed dependent on the csv file, it may be much easier to just create 2 folders.

Also, can you explain how to access all elements of a given class after it has been dumped to the dataset (preferably by ImageFolder)


#4

This depends on the dataset, look at the source code to figure this out: https://github.com/pytorch/vision/blob/master/torchvision/datasets/folder.py