Hi everyone!
I am trying to finetune slowfast models on my custom dataset for video action recognition using pytorchvideo and torchvision. I am a bit unsure of how I have combined transforms from the two libraries together.
Here is how my transformation + augmentation scheme looks like
self.train_transform = ApplyTransformToKey(
key="video",
transform=Compose(
[
UniformTemporalSubsample(32),
Permute([1,0,2,3]),
RandomAffine(degrees=20, translate=(0, 0.1), shear=(-15, 15, -15, 15)),
GaussianBlur(kernel_size=3, sigma=(0.1, 1.5)),
Permute([1,0,2,3]),
Lambda(lambda x: x / 255.0),
Normalize((0.45, 0.45, 0.45), (0.225, 0.225, 0.225)),
RandomShortSideScale(
min_size=256,
max_size=320,
),
RandomCrop(256),
RandomHorizontalFlip(p=0.5),
PackPathway(),
]
),
)
The reason why I am a bit suspicious is that when I tried to visualize the videos produced after they were transformed they looked weird. Here are 2 different videos that I visualized after applying the above transforms : https://imgur.com/a/zcL1zeF
Would be really grateful if someone can please check this and let me know if this is wrong and point me how to correct the same.