I’ve got a CRNN that I’m using on an audio task and I want to have it set up so I can change the size of the matrix being fed into the CNN (different resolution spectrograms etc.) and not have to change any of the code in the model - however, the input to the RNN depends on the size of the CNN output. Is there a way I can pre-calcuate this to avoid having to hardcode it?
If the CNN output is
[batch, channels, frames, freq_bins] I would first swap the axes so the order is
[batch, frames, channels, freq_bins] and then reshape so it’s
[batch, frames, channels*bins], this would then be fed into the RNN.
This is all fine, but the
freq_bins dimension size depends on the pooling layers and I’m not sure of a formula to allow me to calculate the required RNN input size based on the initial size of the input to the CNN and the known pooling layers. I’ve put some code below as an example
Conv1 = nn.Conv2d(in_channels=1, out_channels=Conv1_filts, kernel_size=(3, 3), padding='same') Conv2 = nn.Conv2d(in_channels=Conv1_filts, out_channels=Conv2_filts, kernel_size=(3, 3), padding='same') batch_norm = nn.BatchNorm2d(num_features=64) pool1 = nn.MaxPool2d(kernel_size=(5, 4)) pool2 = nn.MaxPool2d(kernel_size=(1, 4)) pool3 = nn.MaxPool2d(kernel_size=(1, 2)) gru1 = nn.GRU(batch_first=True, input_size=Conv2_filts*(???), hidden_size=128, num_layers=1, bidirectional=True) input = torch.zeros((256, 1, 512, 64)) # batch, channels, frames, freq_bins x = Conv1(input) x = batch_norm(x) x = nn.functional.relu(x) x = pool1(x) x = Conv2(x) x = batch_norm(x) x = nn.functional.relu(x) x = pool2(x) x = Conv2(x) x = batch_norm(x) x = pool3(x) spec_cnn = x.permute((0, 2, 1, 3)) # reshape from [batch, channels, frames, bins] --> [batch, frames, channels, bins] rnn_in = torch.reshape(spec_cnn,(batch_size, *(frames???)*, -1)