Can image "channels" be different shapes?

Hi all

I want to construct a 3 channel “image” from audio recordings to feed into a CNN. The three channels would be:

• STFT i.e. standard spectrogram
• Mel-spectrogram
• “Fingerprint” which is a sparse matrix representing the strongest frequencies in the standard spectrogram. This is modelled after the old Shazam algorithm.

The problem is that each of these three visual representations of audio data has different shapes given the same audio.

Question 1
Can I construct a 3 channel image without resizing the channels to equal dimensions?

Question 2
If not, is there any reason to fear resizing the images to match each other? My concern is that the meaning of the data in each will be corrupted.

Question 3:
Would it perhaps be better to create an ensemble of three different CNN’s rather than try squeeze all these different shaped data through a single model?

Thanks in advance to you audio/visual experts!

Question 3:
Would it perhaps be better to create an ensemble of three different CNN’s rather than try squeeze all these different shaped data through a single model?

Feel free to try different strategies to see which one is the best for your case. My 2 cents here is that this is largely related to whether or not those three channels are related. If three different CNNs are used, they are independent, thus cannot learn the relationship between the three channels before you tangle them at the end. In a typical image, such as one from a camera, the three channels are high correlated because they are the representation of the light reflected from the same group of objects, thus it is the best practice to train using single CNN. But if you restructure an audio into an image format so you can leverage CNN. I think it can be a good idea to have three different CNNs, given those three channels seem to be different format of the same audio. During testing, if you use 3 CNNs and each of them is applied to one channel, you might find one of those 3 CNNs is enough to achieve your goal, this could mean that the corresponding channel is a great representation of the audio signals for your usage case.

1 Like

I believe the three channels will be correlated. They contain almost identical information, except that the data is represented on different scales (like logarithmic vs linear scales).

But I think I will try both ways and see what works, as you suggest