Best Way of Combining Videos with Different Frame Sizes

Hi, I am new to machine learning and computer vision, so I want to make sure I am doing this as optimally as possible.

I have a dataset where the getitem() function returns a tuple of a video tensor and a label, each corresponding to one video file. The video tensor will actually be of shape (clips, channels=3, frames, height, width) because I have the functionality of running predictions on multiple clips in the original video as augmentations. However, when I use a dataloader, some of the video files (samples) have differing frame sizes, so for example the batch stack function would try to stack a (10, 3, 16, 384, 384) tensor with a (10, 3, 16, 440, 440) tensor along a new axis, which would clearly return an error.

I can see that I have the option to either resize each video or pad the smaller ones to match the bigger ones. Which transform would be better in terms of training and testing purposes? Resizing the video would likely keep the mean and standard deviation relatively the same, but it may drastically alter the convolution computations (this is a CNN) because of the difference in pixel location. On the other hand, padding would drastically alter the mean and standard deviation by introducing a bunch of 0s, but theoretically the convolution computations wouldn’t be so different. If I were to pad, where would be the best place to do it? Uniformly around? At a corner?

Furthermore, at what point of the pipeline would it make the most sense to make such transforms? Should I do it when converting the video files to tensors? Or after the conversion, but in the getitem() function? Or in a collate_fn for the dataloader?

Apologies for all the questions, I just want to make sure to pick up the right practices as it seems that very subtle differences could mean a lot in determining performance.

Bumping this. Let me know if there is anything unclear.

Hi :slight_smile:
You are correct that it is indeed theoretically possible to either resize the frames or to pad the smaller ones.
However, an alternative common practice is to crop the image, e.g at random location using RandomCrop or at the center using CenterCrop.

Using these augmentations will solve the problems you mentioned.
However, the disadvantage is that information will be lost.
Nonetheless, if the ratio between the minimal frame dimensions to the maximal frame dimensions isn’t too big, the CenterCrop augmentation generally would not lose too much information and would generate accurate predictions in most of the times, so it is a good practice to use it on the test samples.
Moreover, applying RandCrop to the train frames would improve the model’s generalization to translation of instances in the frame.

Regarding to when to apply these transformations - I would recommend you to write a custom transform and then use it as a transform when loading a Dataset as demonstrated here: Writing Custom Datasets, DataLoaders and Transforms — PyTorch Tutorials 2.1.1+cu121 documentation.

Good luck!