Hi I’m trying to make an autoencoder for speech data. The network’s input and output are Mel spectrograms. How can I obtain the audio waveform from the generated mel spectrogram?
librosa.istft()
should also work, I think.
(from Wave-U-Net utils: https://github.com/f90/Wave-U-Net/blob/master/Utils.py)
As JuanFMontesinos said Librosa is a great and easy way of converting Mel-spectrograms.
I’ve also used WaveGlow in the past to do the same task, you can use the Nvidia implementation here
Here’s a small example using librosa.istft
from this FactorGAN implementation:
def spectrogramToAudioFile(magnitude, fftWindowSize, hopSize, phaseIterations=10, phase=None, length=None):
'''
Computes an audio signal from the given magnitude spectrogram, and optionally an initial phase.
Griffin-Lim is executed to recover/refine the given the phase from the magnitude spectrogram.
:param magnitude: Magnitudes to be converted to audio
:param fftWindowSize: Size of FFT window used to create magnitudes
:param hopSize: Hop size in frames used to create magnitudes
:param phaseIterations: Number of Griffin-Lim iterations to recover phase
:param phase: If given, starts ISTFT with this particular phase matrix
:param length: If given, audio signal is clipped/padded to this number of frames
:return:
'''
if phase is not None:
if phaseIterations > 0:
# Refine audio given initial phase with a number of iterations
return reconPhase(magnitude, fftWindowSize, hopSize, phaseIterations, phase, length)
# reconstructing the new complex matrix
stftMatrix = magnitude * np.exp(phase * 1j) # magnitude * e^(j*phase)
audio = librosa.istft(stftMatrix, hop_length=hopSize, length=length)
else:
audio = reconPhase(magnitude, fftWindowSize, hopSize, phaseIterations)
return audio
Link: https://github.com/f90/FactorGAN/blob/ae57301195984092ee40742273e1034f3ae27e32/Utils.py