Is there any difference in training on the spectrogram of a waveform as compared to training on the waveform itself? Does training on the waveform generate better results in general?
I think the difference would be quite large, as the sampling rate could be high in the time domain and make the training quite challenging. If I remember correctly, one way to make some models work on waveforms directly was to use a stack of conv layers with a specific dilation, so that the input size was feasible to process in the end.
When getting the spectrogram of the waveform, is there any difference between
torch.stft? Also, is there an equivalent of
@ptrblck - Would you happen to know any audio models that use this technique?
Also, is there any advantage to training on time domain data using dilated convolution over just training on time/frequency domain data such as STFT?
I was thinking about WaveNet in my previous post.
This paper is a bit older by now, so I guess there a new insights today about advantages and shortcomings of time- vs. frequency-domain approaches.
@Mole_Turner A lot of the more recent models I’ve read about don’t resort to using a time-frequency (T-F) representation of the input wave signal (e.g. STFT representation). Also, I read in Supervised Speech Separation Based on Deep Learning: An Overview that end-to-end speech separation methods like temporal mapping (which doesn’t require resorting to a T-F representation) has the following advantage:
A potential advantage of this approach is to circumvent the need to use the phase of noisy speech in reconstructing enhanced speech, which can be a drag for speech quality, particularly when input SNR is low.
As a convolution operator is the same as a filter or a feature extractor, CNNs appear to be a natural choice for temporal mapping.