I am trying to train a VQVAE. I used this implementation.
As for the input, I tried various types of input, either taking some part of the original audio or taking that audio and converting it to a 2D matrix and applying minmax normalization. I tried various input types and even different loss functions but the model isn’t able to reconstruct the original audio. I have tried MSE and L1 loss. I have even tried UNet, but none of them seem to work. This is how I load the data:
class data_gen(torch.utils.data.Dataset): def __init__(self, files): self.data = files def __getitem__(self, i): tmp = self.data[i] x = np.load('/aud_images/'+tmp) noise = np.random.normal(0.3,0.01,(256,256)) x_n = x+noise x = np.reshape(x,(1,x.shape,x.shape)) x = torch.from_numpy(x).float() x_n = np.reshape(x_n,(1,x_n.shape,x_n.shape)) x_n = torch.from_numpy(x_n).float() return x_n, x def __len__(self): return len(self.data)
What should I do since almost everywhere I read that MSE loss works fine for audio data. Let me know.