Feature extraction from log-mel spectrograms using CNNs

I am currently working on an ASR-related project in which I would like to combine Convolutional Neural Network with GRU Network and CTC loss function.
The idea is to use the CNN to extract representative features from log-mel spectrograms, and then pass the resulting feature maps as input to the GRU network, after first reducing the dimensions of the maps to (Batch Size x C = 1 x H x W).

I have read the publications on this topic, but I have not found information on whether such a solution makes sense. I know that log-mel spectrograms contain features extracted from audio files, but I would like to make the features more meaningful.

I was planning to use the feature extraction module from the VGG-16 network, but my spectrograms have dimensions of 1x23xW, where W ranges from 600 to 1500 (I made the padding up to the maximum width in the batch), and I cannot then use this module to fine-tune the model and apply it to the GRU because of the discrepancy in the dimensions of the images.

Can you tell me what kind of CNN to use to extract features from spectrograms of such dimensions? I would also appreciate knowing if it makes sense to use CNNs in such a case.

have a look at the wav2vec2 paper by facebook ai. They are not using melspectograms but use convolutions directly on the raw audio data, but the idea of combining a convolutional network for feature extraction and a transformer/recurrent type network for processing these features is used there as well.

As far as I remember Wav2Vec 2.0 uses 1D convolutional layer applied for each time frame of raw audio data. As a result there is created a matrix which rows correspond to vectors extracted with CNN and columns represent time steps. In general - the idea is the same, but I would like to apply CNN to log-mel spectrograms and then pass the output into GRU network. The main question is - Will the use of CNNs produce a better representation of features from spectrograms? And what information can I extract by applying convolutional layers in terms of speech representation?