Torch.stft and torch.istft with window function


I tried something like this:

x = torch.randn(100000)
x_stft = torch.stft(x, n_fft=1024, hop_length=256, win_length=1024, window=torch.hann_window(1024), return_complex=True)
x_reconstruct = torch.istft(x_stft, n_fft=1024, hop_length=256, win_length=1024, window=torch.hann_window(1024), return_complex=False)

Since hanning window is applied twice, I would expect x_reconstruct to be 1.5*x (1.5 is the sum of torch.hann_window(1024)**2 with hop_length = win_length/4). But instead I find x_reconstruct is equal to x.

Do torch.stft and torch.istft perform some internal processing to normalize the result when a window function is applied, rather than simply multiply the window function and the signal? If yes, any details on the whole procedure would be appreciated (e.g. Is this happened in torch.stft, or torch.istft, or both? what is the detailed normalization process the two functions perform?).

Thank you!

Torch stft matches that of libros so you can use it as reference.
I don’t understand where the 1.5 comes from. In fact if you check the docs.
Wrt the inverse:
They just run an iterative minimization process for which they need to know the original window.

Also it wouldn’t be an inverse If doesn’t match the original signal.

1 Like

Here I mean that the weight of window function accumulates duing fft and ifft, and eventually it scales signals by a factor (and if the hop length is chosen correctly, this factor can be a constant). A quick demo is as follows:

import numpy as np

#simulate two taps with 512-point window and 256-point hop
window_size = 512
hop_size = window_size // 4
n_hops = 4
n_fft = 512
dummy_signal = np.random.randn(window_size+hop_size*n_hops)
dummp_signal_taps = [dummy_signal[idx:idx+window_size] for idx in range(0, hop_size*n_hops + 1, hop_size)]
win = np.hanning(window_size)

#fft with window function
dummy_signal_taps_fft = [np.fft.rfft(s*win, n=n_fft) for s in dummp_signal_taps]

#ifft with window function
dummy_signal_taps_reconstruct = [np.fft.irfft(s_fft, n=n_fft) * win for s_fft in dummy_signal_taps_fft]

#reconstruct the original signal
dummy_signal_reconstruct = np.zeros(window_size+hop_size*n_hops)
for i,s in enumerate(dummy_signal_taps_reconstruct):
     dummy_signal_reconstruct[i*hop_size:i*hop_size+window_size] += s

#now check the reliable part of the reconstructed signal
dummy_signal_reliable = dummy_signal[(n_hops-1)*hop_size:-(n_hops-1)*hop_size]
dummy_signal_reconstruct_reliable = dummy_signal_reconstruct[(n_hops-1)*hop_size:-(n_hops-1)*hop_size]

print(dummy_signal_reconstruct_reliable / dummy_signal_reliable)

The printed results will show that, after fft and ifft with hanning window and a hop length of a quarter of the window size, the reconstructed signal is scaled by a factor of approximately 1.5.

Thank you for the information! I will give it a look!

Hmmm in fact you are right:

    # Normalize by sum of squared window
    ifft_window_sum = window_sumsquare(

    approx_nonzero_indices = ifft_window_sum > util.tiny(ifft_window_sum)
    y[..., approx_nonzero_indices] /= ifft_window_sum[approx_nonzero_indices]

There is a normalization step which I think aims to compensate the windowing (as it’s only used when computing the stft but not the fft). Anyway should look into it in depth.

Thanks! I will have a look on it.