I am currently working on a project related to Automatic Speech Recognition. So far I have prepared log-mel spectorgrams, which I intend to pass to a neural network. I am going to implement a network consisting of a feature extractor - a CNN block, and then pass the data to an LSTM network, which will enable text extraction from the spectrograms.
The problem is that each spectrogram has a different width (the height of the spectrogram is fixed). I know that for LSTM (or RNN in general) I can use pad_sequence method to pad the variable width spectrograms to make them all of the same width. Then I can pass that padded images to pack_padded_sequence method which is used for packing the padded sequences into a single tensor. I also know that I should use pack_padded_sequence before the LSTM block to handle variable-width input sequences.
However, I wonder if I need to use pack_padded_sequence before the CNN block? As far as I know, the CNN operates on fixed-length input tensors, so I guess that I can simply pad the spectrograms to the same length using pad_sequence and then pass the padded tensor to the CNN block. Is that correct approach? Here I provided some snippet of code:
import torch import torch.nn as nn from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence class MyModel(nn.Module): def __init__(self, input_size, hidden_size, num_layers, output_size): super(MyModel, self).__init__() self.cnn = CNNBlock() self.avg_pool = nn.AdaptiveAvgPool1d(1) self.lstm = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=num_layers) self.fc = nn.Linear(hidden_size, output_size) def forward(self, x, lengths): x = pad_sequence(x, batch_first=True) x = self.cnn(x) # apply CNN to the padded sequences x = self.avg_pool(x) x = pack_padded_sequence(x, lengths, batch_first=True, enforce_sorted=False) # pack the padded sequences x, _ = self.lstm(x) # apply LSTM to the packed sequences x, _ = pad_packed_sequence(x, batch_first=True) x = self.fc(x) return x
Can you please tell me if this solution is correct? I wonder if I should use pack_padded_sequence also before CNN block like this: pad_sequence - pack_padded_sequence - CNN block and then pass the result to LSTM? Which approach will be correct for that?