ResNet152 - Flipping features vs Flipping video then computing features

Wilan · January 11, 2022, 9:34pm

I’m currently using ResNet152 to extract features from a video, where the extracted features are of shape (1842, 2048). I would like to also extract features for the same video, but horizontally flipped. I see 2 ways to do so:

Flip the video horizontally, then re-compute features
Flip the ResNet152 features for each frame from the already-computed features on the original video

The 2nd option is attractive when considering computing resources - how would that be achieved? Are ResNet152 features able to be “flipped” like this? If so, I’m guessing it can be done with torch.flip?

Thank you for reading! Hoping someone can guide me in the right direction.

ptrblck · January 12, 2022, 12:33am

I’m not sure but don’t think that flipping the intermediate features would work.
If you only flip the input image, e.g. the conv kernels would not be flipped and the intermediate activations (as well as the ouput) would be different. Think about kernels used as edge detectors which work on a flipped image, so I would probably try to use the first approach of flipping the inputs.

Wilan · January 13, 2022, 5:33pm

Thanks so much for your reply @ptrblck! It aligns with what I found when comparing my results from the 2 options (just wanted to be sure).