How to pass video frames to cnnlstm for classification/regression

talhaanwarch · January 5, 2022, 9:05am

I have a video with shape of torch.Size([1,129, 3, 224, 224]) where 1 is batch size, 129 is number of frames, 3 is channel and 224 is the width height dimension.
I want to do classification/regression task. Here can use a 3DCNN or 2DCNN with LSTM. (Let me know if there is any other option).
First i am starting with 2D CNN with LSTM
First is to remove the batch index using i=i[0]. This make the frame index as batch index.
After that i split the frames into batches i=i[len(i)%32::].reshape(len(i)//32,-1,3,224,224)
Goal here is to take first 32 frames as first batch, next 32 frames as next batch, and so on. Let me know if reshaping is correct. The shape is [4, 32, 3, 224, 224]
Then i pass each batch to the model (Resnet).

x1=[]#features
        for x in i:
            x1.append(self.model(x))
        x=torch.stack(x1)

Now i got a shape [4, 32, 1000]. I want to pass this to lstm layer, but i dont know.
So, i create a layer

lstm = nn.LSTM(input_size=1000, hidden_size=512, batch_first=True)
o,(h,c)=lstm(x)
o.shape,h.shape,c.shape
(torch.Size([4, 32, 512]), torch.Size([1, 4, 512]), torch.Size([1, 4, 512]))

Now i can consider the first index as batch and keeping this in mind, i can replicated my single ground label to an array of four and do classification/regression.

But i am not sure above logic or syntax is correct