Hi! I am doing researching on SimCLR style self-supervised learning, which requires image augmentation for contrastive learning.
- Initially, I pass the augmentation callable as the
torchvision.dataset.CIFAR10constructor, which does the augmentation on cpu. I found data loading is the bottleneck for training speed.
- I then noticed that after
torchvision 0.8.0, we are able to integrate augmentation function, inheriting
nn.Module, as part of the cuda routine. This gives me around 50% speed up. I did
from torchvision import transforms as T transforms = nn.Sequential( T.RandomCrop(224), ... )
- I kept investigating this a little bit more. Some weird things happen.
– If I only do augmentation, it runs in less than 1 minute
– If I only train the model (doubling the batchsize, like contrasting), it runs in less than 1 minute
– But If I do augmentation then train the model, it immediately drops to 10 minute
Implementation of the augmentation callable shown below and snapshots
(sorry for putting everything in one pic, I’m new here. I am not allowed to upload multiple images. lol)
I’ve done some preliminary check, torch profiling. Everything looks good. I profiled train after augmentation for one batch, it takes 300ms to finish
I’ve also checked that it is not a tqdm estimating issue by training through a full epoch. Any hints on why this might happen. I run those codes on colab GPU runtime (Tesla T4)