Preparing dataset for CNN with LSTM

I am currently working on a project related to Automatic Speech Recognition. So far I have prepared log-mel spectorgrams, which I intend to pass to a neural network. I am going to implement a network consisting of a feature extractor - a CNN block, and then pass the data to an LSTM network, which will enable text extraction from the spectrograms.

The problem is that each spectrogram has a different width (the height of the spectrogram is fixed). I know that for LSTM (or RNN in general) I can use pad_sequence method to pad the variable width spectrograms to make them all of the same width. Then I can pass that padded images to pack_padded_sequence method which is used for packing the padded sequences into a single tensor. I also know that I should use pack_padded_sequence before the LSTM block to handle variable-width input sequences.

However, I wonder if I need to use pack_padded_sequence before the CNN block? As far as I know, the CNN operates on fixed-length input tensors, so I guess that I can simply pad the spectrograms to the same length using pad_sequence and then pass the padded tensor to the CNN block. Is that correct approach? Here I provided some snippet of code:

import torch
import torch.nn as nn
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence

class MyModel(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, output_size):
        super(MyModel, self).__init__()
        self.cnn = CNNBlock()
        self.avg_pool = nn.AdaptiveAvgPool1d(1)
        self.lstm = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=num_layers)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x, lengths):
        x = pad_sequence(x, batch_first=True)  
        x = self.cnn(x)  # apply CNN to the padded sequences
        x = self.avg_pool(x)
        x = pack_padded_sequence(x, lengths, batch_first=True, enforce_sorted=False)  # pack the padded sequences
        x, _ = self.lstm(x)  # apply LSTM to the packed sequences
        x, _ = pad_packed_sequence(x, batch_first=True)  
        x = self.fc(x)  
        return x

Can you please tell me if this solution is correct? I wonder if I should use pack_padded_sequence also before CNN block like this: pad_sequence - pack_padded_sequence - CNN block and then pass the result to LSTM? Which approach will be correct for that?

Ultimately, you’re going to want to test and see what works best for your dataset and use case, but I would change the following:

  1. Use a higher dim convolution kernel to avoid needing an LSTM(i.e. treat time as a dim).
  2. Add self attention layers. With self-attention, you can just pad with zeros where the sample size is less than the largest on that dim. Self attention will ensure the model can highlight what information is important(i.e. padding wouldn’t be).