VQVAE doesn't learn on audio data

Hi,

I am trying to train a VQVAE. I used this implementation.

As for the input, I tried various types of input, either taking some part of the original audio or taking that audio and converting it to a 2D matrix and applying minmax normalization. I tried various input types and even different loss functions but the model isn’t able to reconstruct the original audio. I have tried MSE and L1 loss. I have even tried UNet, but none of them seem to work. This is how I load the data:

class data_gen(torch.utils.data.Dataset):
    def __init__(self, files):
        
        self.data = files
        
    def __getitem__(self, i):

        tmp = self.data[i]
        x = np.load('/aud_images/'+tmp)
        noise = np.random.normal(0.3,0.01,(256,256))
        x_n = x+noise
        x = np.reshape(x,(1,x.shape[0],x.shape[1]))
        x = torch.from_numpy(x).float()
        x_n = np.reshape(x_n,(1,x_n.shape[0],x_n.shape[1]))
        x_n = torch.from_numpy(x_n).float()

        
        return x_n, x

    def __len__(self): 
        
        return len(self.data)

What should I do since almost everywhere I read that MSE loss works fine for audio data. Let me know.

As I work with Speech Synthesis, from what I have seen people generally have a hard time reconstructing the raw waveforms and I have not seen a lot of people trying to do it as you lose a lot of information in the frequency domain which could be useful for the properties of the Audio data, I think a good idea will be to turn it into more audio friendly representation’s first like Spectrogram or MFCC or Mel-Spectrogram if it is speech.

More information could be found regarding this in torchaudio.transforms.Spectrogram and torchaudio.transforms.MFCC.

Then you can use any vocoder to reconstruct it into the raw waveforms.

I hope it helps :slight_smile:

But constructing audio from data like Spectrogram, I face the problem of output size. I have audio data of different sizes. So the input and output (which will be a fully connected layer) will be of different sizes. So how do I deal with that?

Also, why is it so hard to reconstruct raw audio data?

So then how is Jukebox able to work so well? I don’t think so it uses any frequency domain data for audio reconstruction.

I am not familiar with it but a high level, understanding from what I see is that they do use some other feature representation to reconstruct like they used Transformers for text representations and conditioned the synthesis on it as well.

So this is a good explanation of Jukebox and you can see in the architecture that they are not using any spectral information to recreate the original signal.

1 Like

I don’t know! If it is still of interest to you, another major challenge that I forgot to mention regarding why people often avoid synthesising waveforms is the sampling rate of the raw audio waveform. Generally, it is >= 16000 Hz which means for one second of audio you need to synthesise 16000 data points, which unless done by some parallel process is a lot of datapoints to synthesise. Therefore people often try to synthesise spectrograms and then transform into waveforms with help of some vocoders.

For audio, an autoregressive decoder should be used, or it cannot be trained with MSE loss. Spectral loss can be used without autoregressive structure.

1 Like