Convolutional - lstm batch dimensions question


I want to pass in a sequence of images and translate them into another sequence of images using an LSTM. Each image is passed into a CNN layer which decreases the height and width to 1 and the channel to 128, we reshape that to a 1d 128 and pass the 128 1d vector into the lstm cell.

More concretely, I can pass in b x c x h x w to the convolutional layers/network, which then outputs (after squeezing) b x 128 x 1 x 1 which is effectively b x 128. 128 will be the lstm’s input_size. The shape that the lstm expects is (seq_len, batch, input_size), but from the CNN output we only have batch and input_size, not seq_len. This is because the CNN doesn’t take multiple images at a time. Is the solution to write a for loop in the training code for the CNN seperately on each image, and then pass that input to the lstm model? I was hoping there would be a way to do it faster/better.


My understanding is that your batch_sizeCNN for the CNN is not the same thing as the batch_sizeLSTM for the LSTM. I think, and I can be completely wrong, that in the context of LSTMs your batch_sizeLSTM is 1 while the seq_len == batch_sizeCNN.

Hope this helps!