CNN with LSTM input shapes

I am trying to combine CNN and LSTM for the audio data.
Let us say the output of my CNN model is torch.Size([8, 1, 10, 10] which is [B X C_out X Frequency X Time ]
and the LSTM requires [L X B X InputSize].

My question is what is the inputSize in LSTM and how shall I feed the output of CNN to the LSTM

Please help @ptrblck

The mentioned inputSize in your shape information would correspond to the “feature” dimension.

Since your CNN output is 4-dimensional, you would have to decide which dimensions are corresponding to the temporal dimensions and which to the features.

Assuming you would like to use C_out and Fequency as the features, you could use:

x = torch.randn(8, 1, 10, 10)
x = x.view(x.size(0), -1, x.size(3)) # [batch_size, features=channels*height, seq_len=width]
x = x.permute(2, 0, 1) # [seq_len, batch_size, features]

and pass it to the RNN.

PS: Please don’t tag certain people, as this might discourage others to post a solution :wink:

1 Like

Thank you @ptrblck, This is what I was looking for .

Suppose after feeding the data input to the CNN model, it outputs variable length like the example below.

CNN Shape==========> torch.Size([2, 128, 5, 28])
CNN Shape==========> torch.Size([2, 128, 9, 28])

Now When I am feeding this to the LSTM after performing the below mentioned operations

x = x.view(x.size(0), -1, x.size(3)) # [batch_size, features=channels*height, seq_len=width]
x = x.permute(2, 0, 1) # [seq_len, batch_size, features]

It is giving the error
input.size(-1) must be equal to input_size.

How the variable length from the CNNs is handled before feeding it the LSTM

Sure I will be careful in future.
Actually your explanations are very clear and to the point and I really enjoy those.

In an RNN the temporal dimension is variable, not the feature dim.
You could use the channels (dim1) as the feature dimension and the height*width as the temporal dimension as a workaround.
However, based on your description you would like to use the width as the time dim: [B X C_out X Frequency X Time ].

1 Like

you are right
the temporal dimension is variable as

CNN Shape==========> torch.Size([16, 128, 40, 21])
CNN Shape==========> torch.Size([16, 128, 40, 28])

This is what I get output from the CNN.
Now how do we handle this variable length in RNNs (this is on the fly training)

Same as before: make sure to pass the inputs to the RNN as [seq_len, batch_size, features].
If dim3 is now the time dimension and (dim1+dim2) are the features:

x = x.view(x.size(0), -1, x.size(3))
x = x.permute(2, 0, 1)
1 Like

Hi ptrblck,
During permutation I had to put (0, 2, 1) to match with batch_size, seq_length, out_channels in my case.
For video classification, my variable factor is batch_size so changing batch_size I can control the temporal part of a video. seq_length is coming from the previous block as a part of feature vector which I can’t change. I am little confused here. As temporal part should be controlled by seq_length not by the batch_size.
Please let me know. Thank you once again.


While the batch size can very, it doesn’t represent the temporal dimension, but just how many samples you are processing at once. If your seq_length is static, you are not working with a variable temporal dimension.

Make sure to permute the input to the expected dimensions. By default RNNs expect an input of [seq_len, batch_size, features]. With batch_first=True the input should be [batch_size, seq_len, features].

I am extracting frames from videos. Then each frame is fed into the CNN to get the features and the output from CNN is fed into LSTM. So how can I change the seq_length?

Thanks in advance.

I assume your CNN creates features in the shape [batch_size, features] and you would like to use the batch size as the temporal dimension, since you made sure that the ordering of the input images is appropriate for the use case.
If that’s the case, just unsqueeze a fake batch dimension in dim1 and pass the outputs to the RNN.

Thank you ptrblck. Yes, I named the frames such way that they are sequenced. Thank you once again.


Thanks @ptrblck , it worked as per the requirement

Hi ptrblck,
I was wrong. My dataloader was not taking sequenced data. So I added a sample to get the sequence.
Thanks for posting the sampler code in another thread.


@ptrblck I have question regarding the hidden units of LSTM

self.lstm = nn.LSTM(
                    input_size = 64,
                    hidden_size = 128,
                    num_layers  = 2)

Since the hidden units of LSTM are fixed 128. how does it handle the variable length inputs. It is bit confusing ?
Every time the input takes the batch, its input sequence length changes. so how does it handle it ?

The hidden size is defining the “feature dimension” of the input and is thus unrelated to the temporal dimension.
This lecture on RNNs gives you a good overview how these shapes are used.

1 Like

Thanks @ptrblck
this was very helpful