I am trying to finetune slowfast models on my custom dataset for video action recognition using pytorchvideo and torchvision. I am a bit unsure of how I have combined transforms from the two libraries together.
Here is how my transformation + augmentation scheme looks like
self.train_transform = ApplyTransformToKey( key="video", transform=Compose( [ UniformTemporalSubsample(32), Permute([1,0,2,3]), RandomAffine(degrees=20, translate=(0, 0.1), shear=(-15, 15, -15, 15)), GaussianBlur(kernel_size=3, sigma=(0.1, 1.5)), Permute([1,0,2,3]), Lambda(lambda x: x / 255.0), Normalize((0.45, 0.45, 0.45), (0.225, 0.225, 0.225)), RandomShortSideScale( min_size=256, max_size=320, ), RandomCrop(256), RandomHorizontalFlip(p=0.5), PackPathway(), ] ), )
The reason why I am a bit suspicious is that when I tried to visualize the videos produced after they were transformed they looked weird. Here are 2 different videos that I visualized after applying the above transforms : https://imgur.com/a/zcL1zeF
Would be really grateful if someone can please check this and let me know if this is wrong and point me how to correct the same.