the documentation about torchvision.models in the video classification paragraph says that the expected input size of the frames should be 112x112. Why is there this constraint?
I used a Resnet(2+1)D model as a fixed feature extractor with 180x320 input frames, and it seems to work without any problems, should I resize frames to 112x112 or not? Why?
PS: I think it works anyway thanks to the AdaptivePool at the end of the network, is it right?
The reason that documentation says it expects 112x112 is that the official paper of Kinetic400 which demonstrated benchmarks for 3DConvNets, have used random crop to 112x112 as a data augmentation process. Section 4.3 and 4.4 of have explained the procedure.
And as you have mentioned, in models with adaptive pooling in the last layer, you can pass inputs with arbitrary size but a minimum as the kernel sizes and strides will make input size zero.
Another instance is all image classification pretrained models that use adaptive pool and expect at least 224x224 input size.