Define a dataset with transforms, then splitting for validation

Wesley_Neill · June 28, 2020, 5:49pm

I’m currently loading up some data in the following way. MNIST is a custom dataset that looks pretty much identical to the one in the official tutorial, so nothing special there. to_dtype is a custom transform that does exactly what you would expect, and is also formatted after the official tutorial.

transform = transforms.Compose([transforms.ToPILImage(),
                                transforms.RandomRotation(10, fill=(0,)),
                                transforms.RandomHorizontalFlip(),
                                transforms.RandomPerspective(),
                                transforms.RandomAffine(10),
                                transforms.ToTensor(),
                                to_dtype(),
                                transforms.Normalize((0.5,), (0.5,))])

trainset = MNIST('data/train.csv', transform=transform)

N = len(trainset)
split = (N - int(np.floor(N*.2)), int(np.floor(N*.2)))
trainset, validset = torch.utils.data.random_split(trainset, split)

trainload = DataLoader(trainset, batch_size=32, shuffle=True, num_workers=4)
validload = DataLoader(validset, batch_size=32, shuffle=True, num_workers=4)

apparently .random_split() does not return two datasets of the same type as was passed. I therefore cannot access the transform attribute to turn it off in the validation set.

well, I can with

validset.dataset.transform=None

but that turns off transforms for the training set too.

Should I be worried about this? Is having augmented validation data bad?

ptrblck · June 29, 2020, 9:43am

Instead of using random_split, you could create two datasets, one training dataset with the random transformations, and another validation set with its corresponding transformations.
Once you have created both datasets, you could randomly split the data indices e.g. using sklearn.model_selection.train_test_split. These indices can then be passed to torch.utils.data.Subset together with their datasets in order to create the final training and validation dataset.

Wesley_Neill · June 29, 2020, 2:25pm

Okay, that sounds like a good solution! And since you are suggesting it, I assume the answer to my question is that augmenting the validation set is a bad idea?

Actually, train_test_split works really well with how I defined my dataset class! I can just do this, and it ends up solving my augmentation problem, and I think is a little cleaner:

train_df = pd.read_csv('data/train.csv')
X, y = train_df.iloc[:, 1:], train_df.iloc[:, 0]
x_train, x_valid, y_train, y_valid = model_selection.train_test_split(X, y)

trainset = MNIST(pd.concat([x_train, y_train], axis=1), transform=transform)
validset = MNIST(pd.concat([x_valid, y_valid], axis=1), transform=transforms.ToTensor())

trainload = DataLoader(trainset, batch_size=32, shuffle=True, num_workers=4)
validload = DataLoader(validset, batch_size=32, shuffle=True, num_workers=4)

ptrblck · June 30, 2020, 1:57am

I would say it’s not the usual approach and don’t see a lot of reasons to augment the validation set (but there might be valid use cases I’m not aware of).