Choosing an initial model architecture

How do people go about deciding which model architecture to use if their problem isn’t well documented and therefore doesn’t really have a SOTA?

Even if I think “well okay lets try a CNN” I’m struggling to find (which might mean I haven’t looked hard enough) documentation that outlines how one goes about choosing initial hyperparameters, how many layers, which kind of layers etc.

For reference, I’m looking to input a stereo audio signal and transform it into a multichannel audio signal. So intuition plus some of the literature says I should be looking at CRNNs…but where to start with building one is something I’m struggling with. And whether that’s what I actually need.

When you do choose a model, at what point do you decide “yup this isn’t working I need to try something different”?