The sequence length of tensor from CNN for training LSTM

Dear all,
I try to train CNN(encoder) with LSTM(decoder) for video sequence. For every 16 video frame, it will estimate one value for regression. I use vgg16 for extracting feature for each video frame. The feature was extracted from the conv5_3 layer. So the shape of the tensor is (64, 512, 14, 14)(batch_size, depth, height, width ). I reshape the shape of the tensor to (64, 512, 196) and sum the last row of tensor(64, 512) , then the tensor was used to train LSTM. However, the shape of the input is (batch_size, seq_len, input_size). My sequence length setting is 16(Many to one). So, how to set up the shape of the input tensor for LSTM? Do i need tensor.view(64, 16, -1) for LSTM?

If I understood the problem correctly, you first convert every 16 frames from a video to a 512-D vector which means if your video has, say, n frames you’ll get n/16 of 512-D vectors. Your sequence length is n/16. But you said its 16. Are all videos 256 frames long?

So if you know your ‘n/16’, use tensor.view(batch_size, n/16, 512). I’m not sure if I answered your question. It will be better if you mention the paper you’re implementing.

Thank you for your answer! I found the answer from other discuss in here.
Pytorch timedistributed