The heght and width of the input video/frame for the PyTorchVideo library's pretrained models

I need to use a pretrained model using PyTorchVideo library that provides pretrained model for several models. The main problem of mine is that I could not find any information that for using the pretrained models for nference, what should be the spatial size of the input video/frame( I mean height and width)? To be clearer, I need to use the pretrained models of X3D and r2p1D but I cannot find how should I select the spatial size of input video/frames?

The torchhub_inference_tutorial from the same repository might be a good starter and shows crop_size = 256.

Dear @ptrblck thanks for the answer, one thing is ambiguous for me. Since it is mentioned that crop_size = 256, it means the input size should be (256,256)?

A typical approach is resizing the videos to a large size (as a pre-processing step). Then cropping parts of each video (as a data augmentation step). That’s why (I think) it’s called crop_size.
You have training recipes here pytorchvideo/pytorchvideo_trainer at main · facebookresearch/pytorchvideo · GitHub