Data augmentation

copyrightly · November 3, 2019, 12:54am

I am using the following code to do data augmentation of MNIST:

train_loader = torch.utils.data.DataLoader(
        datasets.MNIST('../data', train=True, download=True,
                       transform=transforms.Compose([
                           transforms.RandomHorizontalFlip(),
                           transforms.RandomResizedCrop(28),
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,)),
                       ])),
        batch_size=args.batch_size, shuffle=True, **kwargs)

I have a question about the line transforms.Normalize((0.1307,), (0.3081,)), 0.1307 and 0.3801 are mean and standard deviation of the original MNIST dataset. They should have been changed after those augmentation. So should I use the new mean and deviation to do normalization? Another question: should I do the same augmentation on test set? If not, training with augmentation and test would be from different distribution, right?

paganpasta · November 3, 2019, 4:03am

The idea of augmenting data is generating ``similar" samples from the data generating distribution possibly to avoid overfitting. The values of mean and std_dev will change however, I believe the difference wouldn’t be significant. You can go ahead and test this hypothesis by computing mean and std_dev after transforms. Also, you apply the same transformation on the test set. Your reasoning is correct in this regard.