VQVAE doesn't learn on audio data


I am trying to train a VQVAE. I used this implementation.

As for the input, I tried various types of input, either taking some part of the original audio or taking that audio and converting it to a 2D matrix and applying minmax normalization. I tried various input types and even different loss functions but the model isn’t able to reconstruct the original audio. I have tried MSE and L1 loss. I have even tried UNet, but none of them seem to work. This is how I load the data:

class data_gen(torch.utils.data.Dataset):
    def __init__(self, files):
        self.data = files
    def __getitem__(self, i):

        tmp = self.data[i]
        x = np.load('/aud_images/'+tmp)
        noise = np.random.normal(0.3,0.01,(256,256))
        x_n = x+noise
        x = np.reshape(x,(1,x.shape[0],x.shape[1]))
        x = torch.from_numpy(x).float()
        x_n = np.reshape(x_n,(1,x_n.shape[0],x_n.shape[1]))
        x_n = torch.from_numpy(x_n).float()

        return x_n, x

    def __len__(self): 
        return len(self.data)

What should I do since almost everywhere I read that MSE loss works fine for audio data. Let me know.

As I work with Speech Synthesis, from what I have seen people generally have a hard time reconstructing the raw waveforms and I have not seen a lot of people trying to do it as you lose a lot of information in the frequency domain which could be useful for the properties of the Audio data, I think a good idea will be to turn it into more audio friendly representation’s first like Spectrogram or MFCC or Mel-Spectrogram if it is speech.

More information could be found regarding this in torchaudio.transforms.Spectrogram and torchaudio.transforms.MFCC.

Then you can use any vocoder to reconstruct it into the raw waveforms.

I hope it helps :slight_smile:

But constructing audio from data like Spectrogram, I face the problem of output size. I have audio data of different sizes. So the input and output (which will be a fully connected layer) will be of different sizes. So how do I deal with that?

Also, why is it so hard to reconstruct raw audio data?

I think you should look into how to generate variable speech data look into Tacotron and other idea’s there is VAE-Tail which gives an idea on how to do it with just VAE.

Also, why is it so hard to reconstruct raw audio data?

Audio has a lot of information in other frequency domains. These features are not captured directly with the audio waveform, you need better feature representations for it.

1 Like

So then how is Jukebox able to work so well? I don’t think so it uses any frequency domain data for audio reconstruction.

I am not familiar with it but a high level, understanding from what I see is that they do use some other feature representation to reconstruct like they used Transformers for text representations and conditioned the synthesis on it as well.

So this is a good explanation of Jukebox and you can see in the architecture that they are not using any spectral information to recreate the original signal.

1 Like

I don’t know! If it is still of interest to you, another major challenge that I forgot to mention regarding why people often avoid synthesising waveforms is the sampling rate of the raw audio waveform. Generally, it is >= 16000 Hz which means for one second of audio you need to synthesise 16000 data points, which unless done by some parallel process is a lot of datapoints to synthesise. Therefore people often try to synthesise spectrograms and then transform into waveforms with help of some vocoders.