N-to-1 frame CNN prediction model: how to represent the data?

Hi all!

I’m relatively new to PyTorch and I have a CNN that can predict the next frame of image based on 1 input frame relatively well. Now I want this model to outputs 1 image based on a sequence of images (e.g. give it 2 frames, make it guess the next frame). But I’m not sure how to represent the data in a way that makes sense in PyTorch. Is there any known way to do this?

I’ve considered concatenating the input images along some dimension (height/width) but that doesn’t make intuitive sense, and I need to do extra shenanigans to make the output dimension work (currently input and output are both single images so they have the same dimension, but if input is multiple images stacked together that’ll not hold).

I’ve also thought about concatenating the input images along the channel, so, for example, 3 RGB images will now have 9 channels and the output which is 1 image will have 3. This, again, doesn’t make intuitive sense to me, and I’m also not sure how to do this in PyTorch.

Thank you in advance for any helpful input!

There are multiple possible approaches, which you could try:

  • you could represent the sequence of images as a new temporal dimension, execute the feature extractor (conv-pool layers) in a loop, and pass the output to an RNN. Afterwards you could use the activation of the last time step and feed it to a classifier (linear layers).
  • instead of using the loop in the feature extractor, you could use nn.*3D modules and use the depth dimension as the temporal dimension.
  • stacking the frames in the channel dimension and using grouped convs could also work, but I think this would

Maybe others have more and better ideas. :slight_smile: