Custom Dataset: Best Practices for Transformations on Training Set

Hello,

I am working on an image segmentation task. There are two folders, “inputs” and “masks”, which both contain the corresponding pairs of the dataset (i.e., input/ipt_xy.npy corresponds to masks/msk_xy.npy).

My dataset class looks like this:

class DatasetBlaBla(Dataset):
    def __init__(self, root_path, ipt, tgt, transform=None):
        super(DatasetBlaBla, self).__init__()
        self.root_path = root_path
        self.ipt = ipt
        self.tgt = tgt
        self.transform = transform
        
    def __len__(self):
        l1 = os.listdir(os.path.join(self.root_path, self.ipt)) 
        l2 = os.listdir(os.path.join(self.root_path, self.tgt))
        number_files_inp = len(l1)
        number_files_tgt = len(l2)

        if number_files_inp == number_files_tgt:
            return number_files_inp


    def __getitem__(self, idx):
        img_path_input_patch = os.path.join(self.root_path, self.ipt, f"ipt_{idx}.npy")
        img_path_tgt_patch = os.path.join(self.root_path, self.tgt, f"tgt_{idx}.npy")
        
        input_patch = np.load(img_path_input_patch)
        tgt_patch = np.load(img_path_tgt_patch)  
        
        
        if self.transform:
            input_patch = self.transform(input_patch)
            tgt_patch = self.transform(tgt_patch)
            
        return input_patch, tgt_patch

When creating the dataset, one instance is created, which I then split into train/val/test using:

train_set, val_set, test_set = torch.utils.data.random_split(dataset, [train_size, val_size, test_size])

Finally, we come to the question:

What are best practices, in this case, to apply transformations on the train_set only?
I have looked through the forum and found a variety of approaches; however, I still wondered whether there is a distinct best practice for my use case. Also, I do not want to split the folders into train/val/test subfolders because the reshuffling will be an issue with my amount of data.

Thank you for your help,

Cheers

My preference would be to create three difference datasets using the desired transformations for the training, validation, and test sets. This approach makes it clear that the train_dataset also uses the train_transform only. Once this is done, create the training, validation, and test indices via any kind of splitting (sklearn.model_selection.train_test_split is quite popular) and wrap the datasets into a Subset with the corresponding indices.

There are certainly other approaches, so let’s also hear from other users. :slight_smile:

@ptrblck thank you for your answer!