My dataset is structured such that every sub-directory is a class (and all images in the subdirectory are images labeled that corresponding class). I want to make the split as described in the title. And I also want to make 10 random partitions using the same split.
To accomplish this, I shuffled all the filenames in every class sub-directory, took 80 and added to the training subset, took 20 and added to the validation subset, and added the rest to the evaluation subset. So I ended up with 10x3 sets of filenames. With that, I made a Custom Dataset that will use those filenames. And it looks as follows:
def __init__(self, root_dir, filenames, label2id, transform=None):
self.root_dir = root_dir
self.filenames = filenames
self.label2id = label2id
self.transform = transform
def __len__(self):
return len(self.filenames)
def __getitem__(self, index):
filename = self.filenames[index]
classname = os.path.dirname(filename)
image_path = os.path.join(self.root_dir, filename)
# Load the image in parallel
with open(image_path, 'rb') as f:
img = Image.open(f).convert('RGB')
if self.transform is not None:
img = self.transform(img)
return img, self.label2id[classname], filename
Previously, when I was only doing a 60/20/20 split on the whole dataset, I just used ImageFolder. And that was about 4x faster per epoch than what I am having here.
So my question is: How should I optimize my Dataset? Is my idea of splitting up the filenames not a good idea to begin with?