Splitting unbalanced classes of images into training, validation and testing sets

Eric_Gill · March 27, 2021, 12:33pm

Hi there,

I have 3 classes of images, Covid, Normal and Viral Pneumonia(VP) in three different folders. There are roughly 3,600; 12,000; and 1,600 images in each folder respectively. As this set is imbalanced, I would like to split it into training, validation and testing subsets while maintaining the proportion of each class in each of the subsets.

I have been doing some research and tried this answer. I loaded all the images using ImageFolder, got an array of the classes (represented as 0, 1 and 2) and retrieved the indexes in a balanced way:

img_dataset = datasets.ImageFolder(
                              root = first_model_im_path,
                              transform = image_transforms["train"]
                       )

train_idx, test_idx= train_test_split(
np.arange(len(class_array)),
test_size=0.2,
shuffle=True,
stratify=class_array)

So far so good, but I have hit the wall at this point. The next step would of course be taking that train_idx set of indices and splitting it further, but I also have to guarantee that it remains balanced. Does anyone have any idea how to proceed? Or indeed, is there a better way to achieve what I am attempting?

I appreciate the help, thanks so much in advance,
Eric

ptrblck · March 29, 2021, 8:17am

You could call train_test_split again on train_idx instead of the np.arange input with stratify=class_array[train_idx] to create the train and validation indices. Let me know, if this works.

Eric_Gill · March 30, 2021, 4:51pm

Thanks so much for your help!