Using pytorchvideo and torchvision transforms together for video action recognition

proxymate · January 19, 2022, 5:05am

Hi everyone!

I am trying to finetune slowfast models on my custom dataset for video action recognition using pytorchvideo and torchvision. I am a bit unsure of how I have combined transforms from the two libraries together.

Here is how my transformation + augmentation scheme looks like

self.train_transform = ApplyTransformToKey(
      key="video",
      transform=Compose(
        [  
          UniformTemporalSubsample(32),
          Permute([1,0,2,3]),
          RandomAffine(degrees=20, translate=(0, 0.1), shear=(-15, 15, -15, 15)),
          GaussianBlur(kernel_size=3, sigma=(0.1, 1.5)),
          Permute([1,0,2,3]),
          Lambda(lambda x: x / 255.0),
          Normalize((0.45, 0.45, 0.45), (0.225, 0.225, 0.225)),
          RandomShortSideScale(
            min_size=256,
            max_size=320,
          ),
          RandomCrop(256),
          RandomHorizontalFlip(p=0.5),
          PackPathway(),
        ]
      ),
    )

The reason why I am a bit suspicious is that when I tried to visualize the videos produced after they were transformed they looked weird. Here are 2 different videos that I visualized after applying the above transforms : https://imgur.com/a/zcL1zeF

Would be really grateful if someone can please check this and let me know if this is wrong and point me how to correct the same.