I have 3 classes of images, Covid, Normal and Viral Pneumonia(VP) in three different folders. There are roughly 3,600; 12,000; and 1,600 images in each folder respectively. As this set is imbalanced, I would like to split it into training, validation and testing subsets while maintaining the proportion of each class in each of the subsets.
I have been doing some research and tried this answer. I loaded all the images using ImageFolder, got an array of the classes (represented as 0, 1 and 2) and retrieved the indexes in a balanced way:
img_dataset = datasets.ImageFolder( root = first_model_im_path, transform = image_transforms["train"] ) train_idx, test_idx= train_test_split( np.arange(len(class_array)), test_size=0.2, shuffle=True, stratify=class_array)
So far so good, but I have hit the wall at this point. The next step would of course be taking that train_idx set of indices and splitting it further, but I also have to guarantee that it remains balanced. Does anyone have any idea how to proceed? Or indeed, is there a better way to achieve what I am attempting?
I appreciate the help, thanks so much in advance,