InverseMelScale

Mukesh1729 · June 23, 2021, 10:09am

If i get the below error what should my input tensor look like to the inverse mel scale function:

inv_mel = torchaudio.transforms.InverseMelScale(n_stft=1025)

inv_mel(s)

TypeError Traceback (most recent call last)

in ()
2 plt.imshow(s)
3 inv_mel = torchaudio.transforms.InverseMelScale(n_stft=1025)
----> 4 inv_mel(s)

1 frames

/usr/local/lib/python3.7/dist-packages/torchaudio/transforms.py in forward(self, melspec)
388 “”"
389 # pack batch
→ 390 shape = melspec.size()
391 melspec = melspec.view(-1, shape[-2], shape[-1])
392

TypeError: ‘int’ object is not callable

ptrblck · June 24, 2021, 3:18am

Based on the error message it seems that you are using a np.array instead of a tensor, as the former returns an int from the .size attribute:

x = np.random.randn(10)
x.size()
> TypeError: 'int' object is not callable

x = torch.randn(10)
x.size()
> torch.Size([10])

so you might want to use tensors.

Mukesh1729 · June 24, 2021, 8:10am

Hey, Patrick thanks for getting back. When I use a tensor then i get the following error. I am bit confused what the input should look like given an mel spectrogram of shape: torch.Size([288, 432, 4])

AssertionError Traceback (most recent call last)

in ()
7
8 inv_mel = torchaudio.transforms.InverseMelScale(n_stft=1023)
----> 9 inv_mel(sample)

1 frames

/usr/local/lib/python3.7/dist-packages/torchaudio/transforms.py in forward(self, melspec)
394 freq, _ = self.fb.size() # (freq, n_mels)
395 melspec = melspec.transpose(-1, -2)
→ 396 assert self.n_mels == n_mels
397
398 specgram = torch.rand(melspec.size()[0], time, freq, requires_grad=True,

AssertionError:

ptrblck · June 24, 2021, 4:53pm

I guess the input shape might be wrong.
From the docs:

n_mels (int, optional) – Number of mel filterbanks. (Default: 128)
…
melspec (Tensor) – A Mel frequency spectrogram of dimension (…, n_mels, time)

Based on this it seems the n_mels argument is set to 128 by default, while your input tensor has a value of 432 in this dimension.

Mukesh1729 · June 25, 2021, 11:43am

Hey Patrick, so my input tensor can be of shape torch.Size([288, 432, 4]) so what do you think my input tensor shape should be in this case?

Why are there three dots here?: A Mel frequency spectrogram of dimension (…, n_mels, time)
does it imply something?

Thanks

ptrblck · June 25, 2021, 11:44pm

The three dots should indicate additional dimensions while the last two dimensions would represent n_mels and time.
Could you set n_mels to 432 in InverseMelScale as it should work:

inv_mel = torchaudio.transforms.InverseMelScale(n_stft=1023, n_mels=432)
x = torch.randn(288, 432, 4)
out = inv_mel(x)
print(out.shape)

Mukesh1729 · June 27, 2021, 9:53am

Hey Patrick,
I actually tried the above step you mentioned before by setting n_mels= 432 and then my cell (jupyter nb) ran for a very long time with no outputs.

ptrblck · June 27, 2021, 8:32pm

I think a long runtime is expected, since InverseMelScale uses SGD to solve the mapping.
From the docs:

Solve for a normal STFT from a mel frequency STFT, using a conversion matrix. This uses triangular filter banks.
It minimizes the euclidian norm between the input mel-spectrogram and the product between the estimated spectrogram and the filter banks using SGD.

Corresponding lines of code.

You could change the tolerance to stop the optimization earlier, if needed.

Mukesh1729 · June 28, 2021, 8:18am

Hey Patrick, thanks for getting back. I will try it out again