How do I chop an image into a sequence of vectors? (for input into GRU layer)

I am a beginner, who have jumped into the deep end, and am trying to learn things that way. I have tried to find answers by using the official resources, but I feel like if you’re not trying to classify mnist, then the tutorials are less useful (and I already got my first toy network to work). So, after staring into the screen for a few hours now, I hope I can ask my (hopefully) simple questions here…

Thanks for reading on :slight_smile:

Mission:
I want to try to implement a simple version of this network (because it’s relevant for what I eventually want to do):

My network:
My input is a series of spectrograms, so it is size: [N,1,129,29]
I wish to first filter that through a conv2d layer with 20 filters:
self.conv1 = nn.Conv2d(1,20,(5, 5), stride=(1, 1), padding=(2, 2))
Output is [N, 20, 129,29]
Then the point is to take each of the 29 ‘columns’ in my output and pass them into a bidirectional GRU-layer, which will then treat them as a sequence of 29 vectors, and iterate over them. But how do I actually do that? I assume I have to somehow split the conv2d output into a ‘minibatch’ with 29 elements, and then pass that to a GRU layer with input size 129?
In the paper, they write:

For convenience, we interpret the image X after the filterbank layers as a sequence of T feature vectors X ≡ (x1, x2, . . . , xT ) where each xt , 1 ≤ t ≤ T , is the image column at time index t. We then aim to read the sequence of feature vectors into a single feature vector using the attention-based bidirectional RNN. The forward and backward recurrent layers of the RNN iterate over individual feature vectors of the sequence in opposite directions and compute forward and backward sequences of hidden state vectors Hf = (hf1 , hf2 , . . . , hfT ) and Hb = (hb1 , hb2 , . . . , hbT ),

I’m unsure where the 29 comes from. Do you mean 129 or 20 or are the “29 columns” created in a special way?
If it’s a typo and you would like to use the 20 channels as the temporal dimension, you could initialize the RNN via batch_first=True, which would expect an input in the shape [batch_size, seq_len, features] and could flatten the conv activation via: x = x.view(x.size(0), x.size(1), -1).
Alternatively, you could also use the default setup in the RNN, which would then expect the input to have the shape [seq_len, batch_size, features] and you would have to permute the tensor into the right shape after flattening the feature dimension.

1 Like

thanks a lot :slight_smile:

the 29 comes from the last dimension in the input (which is still there after the conv2d layer). So the 29 should be the temporal dimension. I think a combination of permute and view is what I am after.