I tried VQVAE on audio and I used a method where I had the raw waveform output to the decoder, converted it to the logarithm of the mel spectrogram and then took loss (I added 1e-7 to make sure it didn’t go to 0).
This worked fine for normal VAE.
However, when I changed to VQVAE, somewhere around the 0.1~3rd epoch of the learning phase, NAN was output at the part where the mel spectrogram is converted (I did not try after the 3rd epoch).
The reason I determined this to be the case was because I checked for inf -inf nan at each stage as follows, and the part that was always caught was the transform.mel_scale(spec) part.
Also, when loss is measured after making a spectrogram instead of a mel-spectrogram, the accuracy is not good, but the learning itself is possible.
assert not float(“inf”) in spec
assert not float(“-inf”) in spec
assert not torch.isnan(spec).any()
assert not torch.isnan(mel).any()