Understanding channel dim in audio processing

Hi!

I’m working on generative audio model, and I want to what’s the most common way of selecting channel dimension. I saw some people after doing torch.stft or spectrogram, add additional dimension to have (batch,1,freq_bins,length) and work on additional channel. But I felt like it’s better to use freq_bins as channel dimension for more Transformer-like style. (batch, length,freq_bins) → (batch,length,latent channel)
Any thought on this?

Thank you,

Hello,
It is definitely recommended to use the freq_bin dimensions. All you have to do is change Conv2D operations to Conv1D operations and you should be good to go! It is more elegant and fast while reducing the number of parameters compared to 2d operations on the additional dimension.

hope this helps!

1 Like