Custom Dataset: Best Practices for Transformations on Training Set

janwinkler1 · July 4, 2022, 9:56am

Hello,

I am working on an image segmentation task. There are two folders, “inputs” and “masks”, which both contain the corresponding pairs of the dataset (i.e., input/ipt_xy.npy corresponds to masks/msk_xy.npy).

My dataset class looks like this:

class DatasetBlaBla(Dataset):
    def __init__(self, root_path, ipt, tgt, transform=None):
        super(DatasetBlaBla, self).__init__()
        self.root_path = root_path
        self.ipt = ipt
        self.tgt = tgt
        self.transform = transform
        
    def __len__(self):
        l1 = os.listdir(os.path.join(self.root_path, self.ipt)) 
        l2 = os.listdir(os.path.join(self.root_path, self.tgt))
        number_files_inp = len(l1)
        number_files_tgt = len(l2)

        if number_files_inp == number_files_tgt:
            return number_files_inp


    def __getitem__(self, idx):
        img_path_input_patch = os.path.join(self.root_path, self.ipt, f"ipt_{idx}.npy")
        img_path_tgt_patch = os.path.join(self.root_path, self.tgt, f"tgt_{idx}.npy")
        
        input_patch = np.load(img_path_input_patch)
        tgt_patch = np.load(img_path_tgt_patch)  
        
        
        if self.transform:
            input_patch = self.transform(input_patch)
            tgt_patch = self.transform(tgt_patch)
            
        return input_patch, tgt_patch

When creating the dataset, one instance is created, which I then split into train/val/test using:

train_set, val_set, test_set = torch.utils.data.random_split(dataset, [train_size, val_size, test_size])

Finally, we come to the question:

What are best practices, in this case, to apply transformations on the train_set only?
I have looked through the forum and found a variety of approaches; however, I still wondered whether there is a distinct best practice for my use case. Also, I do not want to split the folders into train/val/test subfolders because the reshuffling will be an issue with my amount of data.

Thank you for your help,

Cheers

ptrblck · July 4, 2022, 8:32pm

My preference would be to create three difference datasets using the desired transformations for the training, validation, and test sets. This approach makes it clear that the train_dataset also uses the train_transform only. Once this is done, create the training, validation, and test indices via any kind of splitting (sklearn.model_selection.train_test_split is quite popular) and wrap the datasets into a Subset with the corresponding indices.

There are certainly other approaches, so let’s also hear from other users.

janwinkler1 · July 5, 2022, 6:42am

@ptrblck thank you for your answer!