My dataset is structured such that every sub-directory is a class (and all images in the subdirectory are images labeled that corresponding class). I want to make the split as described in the title. And I also want to make 10 random partitions using the same split.
To accomplish this, I shuffled all the filenames in every class sub-directory, took 80 and added to the training subset, took 20 and added to the validation subset, and added the rest to the evaluation subset. So I ended up with 10x3 sets of filenames. With that, I made a Custom Dataset that will use those filenames. And it looks as follows:
def __init__(self, root_dir, filenames, label2id, transform=None): self.root_dir = root_dir self.filenames = filenames self.label2id = label2id self.transform = transform def __len__(self): return len(self.filenames) def __getitem__(self, index): filename = self.filenames[index] classname = os.path.dirname(filename) image_path = os.path.join(self.root_dir, filename) # Load the image in parallel with open(image_path, 'rb') as f: img = Image.open(f).convert('RGB') if self.transform is not None: img = self.transform(img) return img, self.label2id[classname], filename
Previously, when I was only doing a 60/20/20 split on the whole dataset, I just used ImageFolder. And that was about 4x faster per epoch than what I am having here.
So my question is: How should I optimize my Dataset? Is my idea of splitting up the filenames not a good idea to begin with?