haha. I think @yolle103 is referring to something like training on “videos”. Each video is of a different duration and so different number of frames or “images” and he predicts one output for one video input. @yolle103, you could look at some approaches in lip reading research or similar “video” to target problems.
As an approach that I think is okay to start off with: You resize your “images” to fit a certain fixed shape and then sort your “videos” from shortest to longest. After this, pad the ends of the “video” training examples to the largest in the batch. Due to the sorting, this then would not be too much of an extra compute for your GPU. Your maximum video length per batch, though, will be different. For this, after passing it through a initial CNN feature extractor, you must use a sequence modelling model, like an RNN: GRU, LSTM. Now you have a nice, fixed size, output for each “video” to further train using!
[Of course, there are drawbacks to this “sorting” trick, for example, remember to set shuffling of your data to
False everywhere to avoid massive compute overheads. A way to overcome this as a drawback is to call it “curriculum learning” and pass it off as helping your model to converge faster ]
As an example: Videos have been sorted from smallest to largest. Each frame has been reshaped to size 3x80x80 (3channels). batch size is 4 and video 1, video 2, video 3, video 4 are in one batch and are of sizes. 24x3x80x80, 26x3x80x80, 28x3x80x80, 29x3x80x80. Add zero frames of size 3x80x80 to video 1, 2 and 3 and so you get a batch to train of final size 4x29x3x80x80. Then use CNNs and finally RNNs because this number “29” will vary across your batches.