I’m using a custom loop to get the STFT of an audio signal, while doing some processing on a frame by frame basis. The number of resulting frames corresponds to the formula number_of_samples = 1440000 + winsize I pad either end of the signal with winsize //2 number_of_frames = (number_of_samples)//(window_size - window_size//2) % = (1440000 + 1024)//(1024 - 512) % = 2814
During the loop to prevent aliasing I zero pad the end of both the window and input frame by length winsize but my hopsize still remains at 512. This extra zero padding shouldn’t affect the number of frames as it’s done on a frame by frame basis to the current time domain segment under analysis.
I then want to use torchaudio.transforms.Melspectrogram on the original audio signal - this would result in me having two STFT signals, one with linearly spaced frequency bins and one with mel spaced. However the number of frames outputted from the transform is not as expected depending on the value of n_fft. With the n_fft = winsize and center=True it outputs 2816 frames and with center=False it outputs the expected 2814. However if n_fft = 2048 and winsize = 1024 it outputs 2812 frames. I can’t work out why n_fft would effect the number of total frames if frames are based on signal length, window size, and hopsize.
How does that interact when having a value n_fft > win_length? Does it mean the FFT window actually extends outside of the segment covered by your (for example) Hann window? Or is the win_length overridden to match n_fft?
if win_length is shorter than n_fft, it will be zero padded on both sides to match the value of n_fft. When computing stft, only the samples multiplying the non-zero window values are used.
In other words, if your win_length and hop_length keep the same, and n_fft is increased, the frequency axis will have higher resolution (be up-sampled).
But if the n_fft value is matched by padding on a frame by frame basis after the windowed section has been grabbed, and only the original windowed section of the signal is transformed (but at a higher fft resolution due to the zero-padding), would it make more sense for the number of output frames to be hardcoded according to win_length and hop_length? Or is there a particular reason it’s hardcoded according to the n_fft?
Hi
I can’t calculate the n =_frames in my case.
The length of my sample is 90000 and n_fft = 1024, hop_length = 128. According to the formula, the resulting n_frame must be roughly = 696. But torch returns a matrix of n_frames = 704!
Hi @hossein, it is possible that num_frames returns 704, if you set center=True in torch.stft, and it is True by default. What’s your input arguments settings in torch.stft?