I am currently working on an ASR-related project in which I would like to combine Convolutional Neural Network with GRU Network and CTC loss function.
The idea is to use the CNN to extract representative features from log-mel spectrograms, and then pass the resulting feature maps as input to the GRU network, after first reducing the dimensions of the maps to (Batch Size x C = 1 x H x W).
I have read the publications on this topic, but I have not found information on whether such a solution makes sense. I know that log-mel spectrograms contain features extracted from audio files, but I would like to make the features more meaningful.
I was planning to use the feature extraction module from the VGG-16 network, but my spectrograms have dimensions of 1x23xW, where W ranges from 600 to 1500 (I made the padding up to the maximum width in the batch), and I cannot then use this module to fine-tune the model and apply it to the GRU because of the discrepancy in the dimensions of the images.
Can you tell me what kind of CNN to use to extract features from spectrograms of such dimensions? I would also appreciate knowing if it makes sense to use CNNs in such a case.