I am having two different, already trained models for video classification, where one model takes keypoints as input, and another model takes video frames as input.
I want to concatenate their features and then pass concatenated vector to the LSTM.
I am little bit confused if my concatenation is correct and how should input to the LSTM look like. At the moment I have the following:
Model A has output
[128, 256] = [batch size, num of features]
Model B has output
[128, 122 880] = [batch size, num of features]
After concatenation I have:
[128, 123 136]
Since LSTM expects input of shape
[batch_size x seq_len x input_size] I used:
x = x.unsqueeze(1) #[128, 1, 123 136]
But then LSTM does not make sense, since I have no time dimension or? Is this approach correct?