I am using the MelSpectrogram, which is just using Spectrogram under the hood, from torchaudio. I am also using a Conformer from torchaudio. When I put the results of the spectrogram transform into the Conformer, it appears that I need to transpose it. Is that right? I’m surprised if everyone is doing that…
self.transform = torchaudio.transforms.MelSpectrogram(
sample_rate=config.sample_rate,
n_fft=n_fft,
hop_length=hop_length,
n_mels=n_mels
)
self.conformer = torchaudio.models.Conformer(
input_dim=config.input_dim,
num_heads=config.num_heads,
ffn_dim=config.ffn_dim,
num_layers=config.num_layers,
depthwise_conv_kernel_size=config.depthwise_conv_kernel_size,
dropout=config.dropout,
use_group_norm=config.use_group_norm,
convolution_first=config.convolution_first,
)
mel_spectrogram = self.transform(audio)
mel_spectrogram = torch.transpose(mel_spectrogram, 1, 2)
log_mel = torch.log(mel_spectrogram + 1e-9)
bs, length, _ = log_mel.shape
lengths = torch.ones(bs, device=log_mel.device) * length
output, _ = self.conformer(log_mel, lengths)