Im currently doing a sound classification job, what I have did is that I pre-processing the audios, use mel_spectrogram to convert the audio, then just throw it into my CNN Module.
I heard that I can make the audio have more dimensions so that I can get more meaningful features, and it will work better.
But how can I get like 2-dimension features, I don’t even know what kinds of features I have got now so it really confuses me.
Based on your code it seems you are creating mel spectrograms via a transformation, which would bt the input features to your model.
Could you describe this idea in more detail?
I would guess your training might benefit from “more information” in the input signal, but just increasing the input dimensions without adding information sounds a bit wasteful to me.