CNN with LSTM input shapes

shakeel608 · June 25, 2020, 2:21pm

I am trying to combine CNN and LSTM for the audio data.
Let us say the output of my CNN model is torch.Size([8, 1, 10, 10] which is [B X C_out X Frequency X Time ]
and the LSTM requires [L X B X InputSize].

My question is what is the inputSize in LSTM and how shall I feed the output of CNN to the LSTM

Please help @ptrblck

ptrblck · June 26, 2020, 3:34am

The mentioned inputSize in your shape information would correspond to the “feature” dimension.

Since your CNN output is 4-dimensional, you would have to decide which dimensions are corresponding to the temporal dimensions and which to the features.

Assuming you would like to use C_out and Fequency as the features, you could use:

x = torch.randn(8, 1, 10, 10)
x = x.view(x.size(0), -1, x.size(3)) # [batch_size, features=channels*height, seq_len=width]
x = x.permute(2, 0, 1) # [seq_len, batch_size, features]

and pass it to the RNN.

PS: Please don’t tag certain people, as this might discourage others to post a solution

shakeel608 · June 30, 2020, 9:42am

Thank you @ptrblck, This is what I was looking for .

Suppose after feeding the data input to the CNN model, it outputs variable length like the example below.

CNN Shape==========> torch.Size([2, 128, 5, 28])
CNN Shape==========> torch.Size([2, 128, 9, 28])

Now When I am feeding this to the LSTM after performing the below mentioned operations

x = x.view(x.size(0), -1, x.size(3)) # [batch_size, features=channels*height, seq_len=width]
x = x.permute(2, 0, 1) # [seq_len, batch_size, features]

It is giving the error
input.size(-1) must be equal to input_size.

How the variable length from the CNNs is handled before feeding it the LSTM

Sure I will be careful in future.
Actually your explanations are very clear and to the point and I really enjoy those.

ptrblck · June 30, 2020, 9:52am

In an RNN the temporal dimension is variable, not the feature dim.
You could use the channels (dim1) as the feature dimension and the height*width as the temporal dimension as a workaround.
However, based on your description you would like to use the width as the time dim: [B X C_out X Frequency X Time ].

shakeel608 · June 30, 2020, 10:16am

Sorry
you are right
the temporal dimension is variable as

CNN Shape==========> torch.Size([16, 128, 40, 21])
CNN Shape==========> torch.Size([16, 128, 40, 28])

This is what I get output from the CNN.
Now how do we handle this variable length in RNNs (this is on the fly training)

ptrblck · June 30, 2020, 8:57pm

Same as before: make sure to pass the inputs to the RNN as [seq_len, batch_size, features].
If dim3 is now the time dimension and (dim1+dim2) are the features:

x = x.view(x.size(0), -1, x.size(3))
x = x.permute(2, 0, 1)

ananda2020 · June 30, 2020, 9:05pm

Hi ptrblck,
During permutation I had to put (0, 2, 1) to match with batch_size, seq_length, out_channels in my case.
For video classification, my variable factor is batch_size so changing batch_size I can control the temporal part of a video. seq_length is coming from the previous block as a part of feature vector which I can’t change. I am little confused here. As temporal part should be controlled by seq_length not by the batch_size.
Please let me know. Thank you once again.

Regards,
ananda2020

ptrblck · June 30, 2020, 9:37pm

While the batch size can very, it doesn’t represent the temporal dimension, but just how many samples you are processing at once. If your seq_length is static, you are not working with a variable temporal dimension.

Make sure to permute the input to the expected dimensions. By default RNNs expect an input of [seq_len, batch_size, features]. With batch_first=True the input should be [batch_size, seq_len, features].

ananda2020 · June 30, 2020, 9:40pm

I am extracting frames from videos. Then each frame is fed into the CNN to get the features and the output from CNN is fed into LSTM. So how can I change the seq_length?

Thanks in advance.
Regards,
ananda2020

ptrblck · July 1, 2020, 12:24am

I assume your CNN creates features in the shape [batch_size, features] and you would like to use the batch size as the temporal dimension, since you made sure that the ordering of the input images is appropriate for the use case.
If that’s the case, just unsqueeze a fake batch dimension in dim1 and pass the outputs to the RNN.

ananda2020 · July 1, 2020, 12:27am

Thank you ptrblck. Yes, I named the frames such way that they are sequenced. Thank you once again.

Regards,
Alakananda

shakeel608 · July 1, 2020, 12:24pm

Thanks @ptrblck , it worked as per the requirement

ananda2020 · July 1, 2020, 3:23pm

Hi ptrblck,
I was wrong. My dataloader was not taking sequenced data. So I added a sample to get the sequence.
Thanks for posting the sampler code in another thread.

Regards,
ananda2020

shakeel608 · July 1, 2020, 4:43pm

@ptrblck I have question regarding the hidden units of LSTM

self.lstm = nn.LSTM(
                    input_size = 64,
                    hidden_size = 128,
                    num_layers  = 2)

Since the hidden units of LSTM are fixed 128. how does it handle the variable length inputs. It is bit confusing ?
Every time the input takes the batch, its input sequence length changes. so how does it handle it ?

ptrblck · July 1, 2020, 5:10pm

The hidden size is defining the “feature dimension” of the input and is thus unrelated to the temporal dimension.
This lecture on RNNs gives you a good overview how these shapes are used.

shakeel608 · July 2, 2020, 8:16am

Thanks @ptrblck
this was very helpful