Pre-calculating rnn input size to avoid hardcoding


I’ve got a CRNN that I’m using on an audio task and I want to have it set up so I can change the size of the matrix being fed into the CNN (different resolution spectrograms etc.) and not have to change any of the code in the model - however, the input to the RNN depends on the size of the CNN output. Is there a way I can pre-calcuate this to avoid having to hardcode it?

If the CNN output is [batch, channels, frames, freq_bins] I would first swap the axes so the order is [batch, frames, channels, freq_bins] and then reshape so it’s [batch, frames, channels*bins], this would then be fed into the RNN.

This is all fine, but the frames and freq_bins dimension size depends on the pooling layers and I’m not sure of a formula to allow me to calculate the required RNN input size based on the initial size of the input to the CNN and the known pooling layers. I’ve put some code below as an example

Conv1 = nn.Conv2d(in_channels=1, out_channels=Conv1_filts, kernel_size=(3, 3), padding='same')
Conv2 = nn.Conv2d(in_channels=Conv1_filts, out_channels=Conv2_filts, kernel_size=(3, 3), padding='same')
batch_norm = nn.BatchNorm2d(num_features=64)
pool1 = nn.MaxPool2d(kernel_size=(5, 4))
pool2 = nn.MaxPool2d(kernel_size=(1, 4))
pool3 = nn.MaxPool2d(kernel_size=(1, 2))
gru1 = nn.GRU(batch_first=True, input_size=Conv2_filts*(???), hidden_size=128, num_layers=1, bidirectional=True)

input = torch.zeros((256, 1, 512, 64)) # batch, channels, frames, freq_bins

x = Conv1(input)
x = batch_norm(x)
x = nn.functional.relu(x)
x = pool1(x)

x = Conv2(x)
x = batch_norm(x)
x = nn.functional.relu(x)
x = pool2(x)

x = Conv2(x)
x = batch_norm(x)
x = pool3(x)

spec_cnn = x.permute((0, 2, 1, 3)) # reshape from [batch, channels, frames, bins] --> [batch, frames, channels, bins]

rnn_in = torch.reshape(spec_cnn,(batch_size, *(frames???)*, -1)

Not sure if this is allowed - but I’m bumping this.

It seems in Keras once does not need to provide input size to many of the pre-defined layers. This really helps to speed up architecture and input feature experimentation.

Is there no way to neatly do this in PyTorch?

You could either manually calculate the activation shape, print it in the forward and use this printed shape as the input feature dimension for the needed module, or use the nn.Lazy* modules which will set their feature dimensions based on an input.

Are there any draw backs to using nn.Lazy* modules? I had considered using the first option, but if I’m initialising a module in the forward won’t that be resetting the weights each time it’s called?

def forward(self,x):
y = nn.Conv2d(x)

init_rnn(input_shape = y.shape)

out = rnn(y)
return out

Yes, reinitializing the module is the wrong approach and you should register it only once. Take a look at this post as well as the linked docs to check how to use the lazy module approach properly.

Am I right in looking at the docs that there isn’t a nn.LazyRNN/GRU?

Yes, they don’t to be implemented (yet). Feel free to create a feature request on GitHub (and in case you are interested implementing it, also mention it in the request).