This may seem like a silly question but here it is:
I need to develop a generic pipeline to classify pretty long video (typically 1 mn, with 25 rgb fps that is a tensor of shape (3, 25*60, width, height)).
I thought a good way to do it at a reasonable computational cost would be to divide a video into a sequence of smaller segments (10sc for instance). This way, a network would predict each sequence, and the predictions would be aggregated at the end.
One should be able to define the depth (that is, the number of frames) of one element of this sequence, as well as the 3D CNN that will be used to classify each element.
Now my question is, what if one choose a depth of 1 ?
It will give to a 3D CNN a cube of depth 1. Is that an issue ? Is that equivalent to giving a 2D image to a 2D CNN ?
Of course I could detect if the chosen depth is equal to 1 and instantiate a 2D CNN in that case instead of a 3D CNN, but is it really necessary ?
I hope i’m being clear, and thank you in advance !