Does torchaudio.transforms.spectrogram work correctly if n_fft > win_size

I’m using a custom loop to get the STFT of an audio signal, while doing some processing on a frame by frame basis. The number of resulting frames corresponds to the formula
number_of_samples = 1440000 + winsize I pad either end of the signal with winsize //2
number_of_frames = (number_of_samples)//(window_size - window_size//2)
% = (1440000 + 1024)//(1024 - 512)
% = 2814

During the loop to prevent aliasing I zero pad the end of both the window and input frame by length winsize but my hopsize still remains at 512. This extra zero padding shouldn’t affect the number of frames as it’s done on a frame by frame basis to the current time domain segment under analysis.

I then want to use torchaudio.transforms.Melspectrogram on the original audio signal - this would result in me having two STFT signals, one with linearly spaced frequency bins and one with mel spaced. However the number of frames outputted from the transform is not as expected depending on the value of n_fft. With the n_fft = winsize and center=True it outputs 2816 frames and with center=False it outputs the expected 2814. However if n_fft = 2048 and winsize = 1024 it outputs 2812 frames. I can’t work out why n_fft would effect the number of total frames if frames are based on signal length, window size, and hopsize.

Hi @Mole_m7b5, thanks for posting the question. The number of frames is hard coded in torch.stft

  int64_t n_frames = 1 + (len - n_fft) / hop_length;
  // time2col
  input = input.as_strided(
    {batch, n_frames, n_fft},
    {input.stride(0), hop_length * input.stride(1), input.stride(1)}
  );
  if (window_.defined()) {
    input = input.mul(window_);
  }

If you increase the value of n_fft, the number of frames will be decreased.

Hi @nateanl, thanks for the response.

How does that interact when having a value n_fft > win_length? Does it mean the FFT window actually extends outside of the segment covered by your (for example) Hann window? Or is the win_length overridden to match n_fft?

if win_length is shorter than n_fft, it will be zero padded on both sides to match the value of n_fft. When computing stft, only the samples multiplying the non-zero window values are used.

In other words, if your win_length and hop_length keep the same, and n_fft is increased, the frequency axis will have higher resolution (be up-sampled).

That makes sense.

But if the n_fft value is matched by padding on a frame by frame basis after the windowed section has been grabbed, and only the original windowed section of the signal is transformed (but at a higher fft resolution due to the zero-padding), would it make more sense for the number of output frames to be hardcoded according to win_length and hop_length? Or is there a particular reason it’s hardcoded according to the n_fft?

The design follows that of librosa stft, which also uses n_fft to detect number of frames.

There’s also a discussion about win_length and n_fft you might be interested: Semantics of n_fft, window length, and frame length · Issue #695 · librosa/librosa · GitHub