Train Test Split with separate transforms?

Arohan_Ajit · September 4, 2020, 6:02am

I’m a beginner in PyTorch but I’ve made a data pipeline a couple of time. The way I know to split the data is, by taking indices and separating them into train and test.:

data_transforms = transforms.Compose([
    transforms.Resize((50,50)),
    transforms.RandomRotation(degrees=30),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
    ])

data = ImageFolder('breast-histopathology/',transform=data_transforms)
valid_size = 0.15
test_size = 0.15

num_train = len(data)
indices = list(range(num_train))
np.random.shuffle(indices)
valid_split = int(np.floor((valid_size) * num_train))
test_split = int(np.floor((valid_size+test_size) * num_train))
valid_idx, test_idx, train_idx = indices[:valid_split], indices[valid_split:test_split], indices[test_split:]

print(len(valid_idx), len(test_idx), len(train_idx))

# define samplers for obtaining training and validation batches
train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)
test_sampler = SubsetRandomSampler(test_idx)


loaders = {
    'train': torch.utils.data.DataLoader(data, batch_size=128, sampler=train_sampler),
    'test': torch.utils.data.DataLoader(data, batch_size=32, sampler=test_sampler),
    'valid': torch.utils.data.DataLoader(data, batch_size=32, sampler=valid_sampler),
}

However I don’t know how to do it incase I want separate transforms. This method provides one data transform for the whole dataset. Is there a way to divide dataset and specify separate transforms for each subset(eg. augmented data for train and original for validation).
P.S it can be done by making separate train and test folders using shutil or os but I was thinking if there’s a method in pytorch for doing so.

tom · September 4, 2020, 8:54pm

For best practice:

I would advise to separate train/val/test earlier, possibly even at the file system level.
I know nothing about your dataset, but you have checked that just splitting at a image level is the right thing to do (in medical imaging, you typically want to have all images of a given patient on the same side of the split and the like)?
After the above two, it is natural to have separate datasets for train/val/test. At that point, having different transforms is easy.

Best regards

Thomas