Hi,

I am trying to train a VQVAE. I used this implementation.

As for the input, I tried various types of input, either taking some part of the original audio or taking that audio and converting it to a 2D matrix and applying minmax normalization. I tried various input types and even different loss functions but the model isn’t able to reconstruct the original audio. I have tried MSE and L1 loss. I have even tried UNet, but none of them seem to work. This is how I load the data:

```
class data_gen(torch.utils.data.Dataset):
def __init__(self, files):
self.data = files
def __getitem__(self, i):
tmp = self.data[i]
x = np.load('/aud_images/'+tmp)
noise = np.random.normal(0.3,0.01,(256,256))
x_n = x+noise
x = np.reshape(x,(1,x.shape[0],x.shape[1]))
x = torch.from_numpy(x).float()
x_n = np.reshape(x_n,(1,x_n.shape[0],x_n.shape[1]))
x_n = torch.from_numpy(x_n).float()
return x_n, x
def __len__(self):
return len(self.data)
```

What should I do since almost everywhere I read that MSE loss works fine for audio data. Let me know.