Is it possible to build such a model structure to get the final result:
Notation: N is batch_size, h is lstm hidden_size
The output of CNN would be of shape [N, C, W, H].
Then reshape it to [N, W * H, C].
The output of the LSTM at the last time step would be [N, 1, h].
Now, on this output of LSTM, you can do one thing is do
.repeat(1, W * H, 1),
so that the output would now be of shape [N, W * H, h]
Now, you can concatenate the output of CNN and LSTM at dim=2 so that the overall output shape would be: [N, W * H, C + h].
Now, you can pass it to the linear layer and do whatever you want to do.
Thank you very much. This is the first time I am going to combine these two models, thank you for your suggestions.