CNN RNN Hybrid - sequences of variable length

Hi all,

I am currently trying to train a network for a regression task, basically I need to map a variable-length sequence (say a spectrogram) to a fixed size vector. My baseline model was a GRU (many-to-one) which mapped the sequence to a vector, and then fed it to a Linear layer for the output, this didn’t go very well. Which led me to believe that perhaps I should shrink the spectrogram before handing it over to the GRU. This actually helped a lot.

My problem is, with the baseline GRU model I just used PackedSequence because I knew the original lengths of the sequences (which are need to create a PackedSequence). But now, after the sequences first pass through a conv2d layer, I “lose” that information and no longer know the length of sequences. So my question is - what it the common practice in such cases? I am open to other architectures as well btw.


Maybe I’m misunderstanding something, but why would passing a sequence through a convolutional layer make you “lose” information about the length of that sequence? With a given series of convolutions, you should be able to quite easily figure out the output shape given the input shape?

So let’s say I have sequences of different length, I need to zero-pad them all to the same length to work in batches. So for each sequence I “remember” its length. Let’s say for example I have a sequence tensor of 300 x 100, and I “remember” that its actual length is 230, meaning there are 70 zero padded time frames. Now, after I pass that through conv+pool this shrinks to a new size, the old length of 230 is not longer relevant. Did I manage to explain myself? I feel like maybe this is a bit tricky, or I’m missing something.

Hey @Felix_Kreuk,
Did you manage to find a solution to this, I am in the exact same position now.