Question 3:
Would it perhaps be better to create an ensemble of three different CNN’s rather than try squeeze all these different shaped data through a single model?
Feel free to try different strategies to see which one is the best for your case. My 2 cents here is that this is largely related to whether or not those three channels are related. If three different CNNs are used, they are independent, thus cannot learn the relationship between the three channels before you tangle them at the end. In a typical image, such as one from a camera, the three channels are high correlated because they are the representation of the light reflected from the same group of objects, thus it is the best practice to train using single CNN. But if you restructure an audio into an image format so you can leverage CNN. I think it can be a good idea to have three different CNNs, given those three channels seem to be different format of the same audio. During testing, if you use 3 CNNs and each of them is applied to one channel, you might find one of those 3 CNNs is enough to achieve your goal, this could mean that the corresponding channel is a great representation of the audio signals for your usage case.