100 Training Samples Per Class, 80/20 Train-Val, Rest for Evaluation

bongbonglemon · May 10, 2023, 1:38am

My dataset is structured such that every sub-directory is a class (and all images in the subdirectory are images labeled that corresponding class). I want to make the split as described in the title. And I also want to make 10 random partitions using the same split.

To accomplish this, I shuffled all the filenames in every class sub-directory, took 80 and added to the training subset, took 20 and added to the validation subset, and added the rest to the evaluation subset. So I ended up with 10x3 sets of filenames. With that, I made a Custom Dataset that will use those filenames. And it looks as follows:

    def __init__(self, root_dir, filenames, label2id, transform=None):
        self.root_dir = root_dir
        self.filenames = filenames
        self.label2id = label2id
        self.transform = transform

    def __len__(self):
        return len(self.filenames)

    def __getitem__(self, index):
        filename = self.filenames[index]
        classname = os.path.dirname(filename)
        image_path = os.path.join(self.root_dir, filename)

        # Load the image in parallel
        with open(image_path, 'rb') as f:
            img = Image.open(f).convert('RGB')
            if self.transform is not None:
                img = self.transform(img)

        return img, self.label2id[classname], filename

Previously, when I was only doing a 60/20/20 split on the whole dataset, I just used ImageFolder. And that was about 4x faster per epoch than what I am having here.

So my question is: How should I optimize my Dataset? Is my idea of splitting up the filenames not a good idea to begin with?