Inverse MelSpectrogram

Hi I’m trying to make an autoencoder for speech data. The network’s input and output are Mel spectrograms. How can I obtain the audio waveform from the generated mel spectrogram?

librosa.istft() should also work, I think.

(from Wave-U-Net utils: Wave-U-Net/ at master · f90/Wave-U-Net · GitHub)

As JuanFMontesinos said Librosa is a great and easy way of converting Mel-spectrograms.

I’ve also used WaveGlow in the past to do the same task, you can use the Nvidia implementation here

Here’s a small example using librosa.istft from this FactorGAN implementation:

def spectrogramToAudioFile(magnitude, fftWindowSize, hopSize, phaseIterations=10, phase=None, length=None):
    Computes an audio signal from the given magnitude spectrogram, and optionally an initial phase.
    Griffin-Lim is executed to recover/refine the given the phase from the magnitude spectrogram.
    :param magnitude: Magnitudes to be converted to audio
    :param fftWindowSize: Size of FFT window used to create magnitudes
    :param hopSize: Hop size in frames used to create magnitudes
    :param phaseIterations: Number of Griffin-Lim iterations to recover phase
    :param phase: If given, starts ISTFT with this particular phase matrix
    :param length: If given, audio signal is clipped/padded to this number of frames
    if phase is not None:
        if phaseIterations > 0:
            # Refine audio given initial phase with a number of iterations
            return reconPhase(magnitude, fftWindowSize, hopSize, phaseIterations, phase, length)
        # reconstructing the new complex matrix
        stftMatrix = magnitude * np.exp(phase * 1j) # magnitude * e^(j*phase)
        audio = librosa.istft(stftMatrix, hop_length=hopSize, length=length)
        audio = reconPhase(magnitude, fftWindowSize, hopSize, phaseIterations)
    return audio

Link: FactorGAN/ at ae57301195984092ee40742273e1034f3ae27e32 · f90/FactorGAN · GitHub