Training speed drops a lot after doing augmentation on GPU

Hi! I am doing researching on SimCLR style self-supervised learning, which requires image augmentation for contrastive learning.

  • Initially, I pass the augmentation callable as the transform parameter for torchvision.dataset.CIFAR10 constructor, which does the augmentation on cpu. I found data loading is the bottleneck for training speed.
  • I then noticed that after torchvision 0.8.0, we are able to integrate augmentation function, inheriting nn.Module , as part of the cuda routine. This gives me around 50% speed up. I did
from torchvision import transforms as T
transforms = nn.Sequential(
  • I kept investigating this a little bit more. Some weird things happen.
    – If I only do augmentation, it runs in less than 1 minute
    – If I only train the model (doubling the batchsize, like contrasting), it runs in less than 1 minute
    – But If I do augmentation then train the model, it immediately drops to 10 minute
    Implementation of the augmentation callable shown below and snapshots
    (sorry for putting everything in one pic, I’m new here. I am not allowed to upload multiple images. lol)

I’ve done some preliminary check, torch profiling. Everything looks good. I profiled train after augmentation for one batch, it takes 300ms to finish

I’ve also checked that it is not a tqdm estimating issue by training through a full epoch. Any hints on why this might happen. I run those codes on colab GPU runtime (Tesla T4)

How is the timing done per iteration? I’m curious if there are torch.cuda.synchronize() calls before timing is started and before timing is stopped as otherwise the measurements could be incorrect.

Additionally, is/are the size(s) of your input(s) comparable with and without augmentation?