I want to pass in a sequence of images and translate them into another sequence of images using an LSTM. Each image is passed into a CNN layer which decreases the height and width to 1 and the channel to 128, we reshape that to a 1d 128 and pass the 128 1d vector into the lstm cell.
More concretely, I can pass in b x c x h x w to the convolutional layers/network, which then outputs (after squeezing) b x 128 x 1 x 1 which is effectively b x 128. 128 will be the lstm’s input_size. The shape that the lstm expects is (seq_len, batch, input_size), but from the CNN output we only have batch and input_size, not seq_len. This is because the CNN doesn’t take multiple images at a time. Is the solution to write a for loop in the training code for the CNN seperately on each image, and then pass that input to the lstm model? I was hoping there would be a way to do it faster/better.