I am a beginner, who have jumped into the deep end, and am trying to learn things that way. I have tried to find answers by using the official resources, but I feel like if you’re not trying to classify mnist, then the tutorials are less useful (and I already got my first toy network to work). So, after staring into the screen for a few hours now, I hope I can ask my (hopefully) simple questions here…

Thanks for reading on

Mission:

I want to try to implement a simple version of this network (because it’s relevant for what I eventually want to do):

My network:

My input is a series of spectrograms, so it is size: [N,1,129,29]

I wish to first filter that through a conv2d layer with 20 filters:

self.conv1 = nn.Conv2d(1,20,(5, 5), stride=(1, 1), padding=(2, 2))

Output is [N, 20, 129,29]

Then the point is to take each of the 29 ‘columns’ in my output and pass them into a bidirectional GRU-layer, which will then treat them as a sequence of 29 vectors, and iterate over them. But how do I actually do that? I assume I have to somehow split the conv2d output into a ‘minibatch’ with 29 elements, and then pass that to a GRU layer with input size 129?

In the paper, they write:

For convenience, we interpret the image X after the filterbank layers as a sequence of T feature vectors X ≡ (x1, x2, . . . , xT ) where each xt , 1 ≤ t ≤ T , is the image column at time index t. We then aim to read the sequence of feature vectors into a single feature vector using the attention-based bidirectional RNN. The forward and backward recurrent layers of the RNN iterate over individual feature vectors of the sequence in opposite directions and compute forward and backward sequences of hidden state vectors Hf = (hf1 , hf2 , . . . , hfT ) and Hb = (hb1 , hb2 , . . . , hbT ),