What is correct shape of input to the LSTM when concatenating features of two different models?

Hello,

I am having two different, already trained models for video classification, where one model takes keypoints as input, and another model takes video frames as input.

I want to concatenate their features and then pass concatenated vector to the LSTM.

I am little bit confused if my concatenation is correct and how should input to the LSTM look like. At the moment I have the following:

Model A has output [128, 256] = [batch size, num of features]
Model B has output [128, 122 880] = [batch size, num of features]
After concatenation I have: [128, 123 136]

Since LSTM expects input of shape [batch_size x seq_len x input_size] I used:

x = x.unsqueeze(1) #[128, 1, 123 136]

But then LSTM does not make sense, since I have no time dimension or? Is this approach correct?

Thanks!

Indeed none of the outputs contain a temporal dimension so it seems both models reduced it in their forward method assuming the original inputs to them still contained the temporal information. Could you thus explain the architecture of modelA and modelB and check if keeping the temporal dimension would make sense?

Model A is a simple 1D CNN model that takes keypoints as data input. Model B is pretrained Resnet that takes videos as input.

For 1D CNN with keypoints as input data I printed the change of shape during forward:

torch.Size([512, 51, 45]) # input data [batch size, num of features, num of frames = time dimension]
torch.Size([512, 51, 45])
torch.Size([512, 64, 43])
torch.Size([512, 64, 43])
torch.Size([512, 64, 43])
torch.Size([512, 128, 20])
torch.Size([512, 128, 20])
torch.Size([512, 128, 20])
torch.Size([512, 256, 8])
torch.Size([512, 256, 8])
torch.Size([512, 256, 8])
torch.Size([512, 8, 256])
torch.Size([512, 512])
torch.Size([512, 512])
torch.Size([512, 256]) # -> I took this model up to this point 
torch.Size([512, 8]) 

So I am assuming, that I should take the model from the “point” where time dimension is stil there or?

But if another model takes videos as files, and shapes look little bit different… can I still simply concatenate them or?

Here also my model B, input data [batch size, num of channels, num of frames, height, width]

while the last few rows of the model (withotu fc layer) are:

        ResStage-193       [-1, 2048, 22, 8, 8]               0
       AvgPool3d-194       [-1, 2048, 15, 2, 2]               0
         Dropout-195       [-1, 2048, 15, 2, 2]               0 #[batch, num of features, num of frames, height, width]
 ResNetBasicHead-196               [-1, 122880]               0
             Net-197               [-1, 122880]               0

So when I retain time dimension I am having the following outputs from both models:

[128, 256, 8]                  #[batch, num of features, num of frames
[128, 2048, 15, 2, 2]      #[batch, num of features, num of frames, height, width]

How can i concatenate them?