I’m using a custom loop to get the STFT of an audio signal, while doing some processing on a frame by frame basis. The number of resulting frames corresponds to the formula
number_of_samples = 1440000 + winsize I pad either end of the signal with
number_of_frames = (number_of_samples)//(window_size - window_size//2)
% = (1440000 + 1024)//(1024 - 512)
% = 2814
During the loop to prevent aliasing I zero pad the end of both the window and input frame by length
winsize but my hopsize still remains at 512. This extra zero padding shouldn’t affect the number of frames as it’s done on a frame by frame basis to the current time domain segment under analysis.
I then want to use
torchaudio.transforms.Melspectrogram on the original audio signal - this would result in me having two STFT signals, one with linearly spaced frequency bins and one with mel spaced. However the number of frames outputted from the transform is not as expected depending on the value of
n_fft. With the
n_fft = winsize and
center=True it outputs 2816 frames and with
center=False it outputs the expected 2814. However if
n_fft = 2048 and
winsize = 1024 it outputs 2812 frames. I can’t work out why
n_fft would effect the number of total frames if frames are based on signal length, window size, and hopsize.