# Torch.stft and torch.istft with window function

Hi,

I tried something like this:

``````x = torch.randn(100000)
x_stft = torch.stft(x, n_fft=1024, hop_length=256, win_length=1024, window=torch.hann_window(1024), return_complex=True)
x_reconstruct = torch.istft(x_stft, n_fft=1024, hop_length=256, win_length=1024, window=torch.hann_window(1024), return_complex=False)
``````

Since hanning window is applied twice, I would expect `x_reconstruct` to be `1.5*x` (`1.5` is the sum of `torch.hann_window(1024)**2` with `hop_length = win_length/4`). But instead I find `x_reconstruct` is equal to `x`.

Do `torch.stft` and `torch.istft` perform some internal processing to normalize the result when a window function is applied, rather than simply multiply the window function and the signal? If yes, any details on the whole procedure would be appreciated (e.g. Is this happened in` torch.stft`, or `torch.istft`, or both? what is the detailed normalization process the two functions perform?).

Thank you!

Torch stft matches that of libros so you can use it as reference.
I don’t understand where the 1.5 comes from. In fact if you check the docs.
Wrt the inverse:
https://librosa.org/doc/main/generated/librosa.istft.html
They just run an iterative minimization process for which they need to know the original window.

Also it wouldn’t be an inverse If doesn’t match the original signal.

1 Like

Here I mean that the weight of window function accumulates duing fft and ifft, and eventually it scales signals by a factor (and if the hop length is chosen correctly, this factor can be a constant). A quick demo is as follows:

``````import numpy as np

#simulate two taps with 512-point window and 256-point hop
window_size = 512
hop_size = window_size // 4
n_hops = 4
n_fft = 512
dummy_signal = np.random.randn(window_size+hop_size*n_hops)
dummp_signal_taps = [dummy_signal[idx:idx+window_size] for idx in range(0, hop_size*n_hops + 1, hop_size)]
win = np.hanning(window_size)

#fft with window function
dummy_signal_taps_fft = [np.fft.rfft(s*win, n=n_fft) for s in dummp_signal_taps]

#ifft with window function
dummy_signal_taps_reconstruct = [np.fft.irfft(s_fft, n=n_fft) * win for s_fft in dummy_signal_taps_fft]

#reconstruct the original signal
dummy_signal_reconstruct = np.zeros(window_size+hop_size*n_hops)
for i,s in enumerate(dummy_signal_taps_reconstruct):
dummy_signal_reconstruct[i*hop_size:i*hop_size+window_size] += s

#now check the reliable part of the reconstructed signal
dummy_signal_reliable = dummy_signal[(n_hops-1)*hop_size:-(n_hops-1)*hop_size]
dummy_signal_reconstruct_reliable = dummy_signal_reconstruct[(n_hops-1)*hop_size:-(n_hops-1)*hop_size]

print(dummy_signal_reconstruct_reliable / dummy_signal_reliable)
``````

The printed results will show that, after fft and ifft with hanning window and a hop length of a quarter of the window size, the reconstructed signal is scaled by a factor of approximately 1.5.

Thank you for the information! I will give it a look!

Hmmm in fact you are right:

``````    # Normalize by sum of squared window
ifft_window_sum = window_sumsquare(
window=window,
n_frames=n_frames,
win_length=win_length,
n_fft=n_fft,
hop_length=hop_length,
dtype=dtype,
)

approx_nonzero_indices = ifft_window_sum > util.tiny(ifft_window_sum)
y[..., approx_nonzero_indices] /= ifft_window_sum[approx_nonzero_indices]
``````

There is a normalization step which I think aims to compensate the windowing (as it’s only used when computing the stft but not the fft). Anyway should look into it in depth.

Thanks! I will have a look on it.