Inverse MelSpectrogram

neovand · August 23, 2020, 10:52pm

Hi I’m trying to make an autoencoder for speech data. The network’s input and output are Mel spectrograms. How can I obtain the audio waveform from the generated mel spectrogram?

JuanFMontesinos · August 23, 2020, 11:57pm

https://librosa.org/doc/latest/generated/librosa.feature.inverse.mel_to_audio.html?highlight=mel%20spectrogram

TPU · February 27, 2021, 3:46pm

librosa.istft() should also work, I think.

(from Wave-U-Net utils: https://github.com/f90/Wave-U-Net/blob/master/Utils.py)

knoriy · March 5, 2021, 5:25pm

As JuanFMontesinos said Librosa is a great and easy way of converting Mel-spectrograms.

I’ve also used WaveGlow in the past to do the same task, you can use the Nvidia implementation here

TPU · March 10, 2021, 12:17am

Here’s a small example using librosa.istft from this FactorGAN implementation:

def spectrogramToAudioFile(magnitude, fftWindowSize, hopSize, phaseIterations=10, phase=None, length=None):
    '''
    Computes an audio signal from the given magnitude spectrogram, and optionally an initial phase.
    Griffin-Lim is executed to recover/refine the given the phase from the magnitude spectrogram.
    :param magnitude: Magnitudes to be converted to audio
    :param fftWindowSize: Size of FFT window used to create magnitudes
    :param hopSize: Hop size in frames used to create magnitudes
    :param phaseIterations: Number of Griffin-Lim iterations to recover phase
    :param phase: If given, starts ISTFT with this particular phase matrix
    :param length: If given, audio signal is clipped/padded to this number of frames
    :return:
    '''
    if phase is not None:
        if phaseIterations > 0:
            # Refine audio given initial phase with a number of iterations
            return reconPhase(magnitude, fftWindowSize, hopSize, phaseIterations, phase, length)
        # reconstructing the new complex matrix
        stftMatrix = magnitude * np.exp(phase * 1j) # magnitude * e^(j*phase)
        audio = librosa.istft(stftMatrix, hop_length=hopSize, length=length)
    else:
        audio = reconPhase(magnitude, fftWindowSize, hopSize, phaseIterations)
    return audio

Link: https://github.com/f90/FactorGAN/blob/ae57301195984092ee40742273e1034f3ae27e32/Utils.py