Hi,

I tried something like this:

```
x = torch.randn(100000)
x_stft = torch.stft(x, n_fft=1024, hop_length=256, win_length=1024, window=torch.hann_window(1024), return_complex=True)
x_reconstruct = torch.istft(x_stft, n_fft=1024, hop_length=256, win_length=1024, window=torch.hann_window(1024), return_complex=False)
```

Since hanning window is applied twice, I would expect `x_reconstruct`

to be `1.5*x`

(`1.5`

is the sum of `torch.hann_window(1024)**2`

with `hop_length = win_length/4`

). But instead I find `x_reconstruct`

is equal to `x`

.

Do `torch.stft`

and `torch.istft`

perform some internal processing to normalize the result when a window function is applied, rather than simply multiply the window function and the signal? If yes, any details on the whole procedure would be appreciated (e.g. Is this happened in` torch.stft`

, or `torch.istft`

, or both? what is the detailed normalization process the two functions perform?).

Thank you!

Torch stft matches that of libros so you can use it as reference.

I donâ€™t understand where the 1.5 comes from. In fact if you check the docs.

Wrt the inverse:

https://librosa.org/doc/main/generated/librosa.istft.html

They just run an iterative minimization process for which they need to know the original window.

Also it wouldnâ€™t be an inverse If doesnâ€™t match the original signal.

1 Like

Here I mean that the weight of window function accumulates duing fft and ifft, and eventually it scales signals by a factor (and if the hop length is chosen correctly, this factor can be a constant). A quick demo is as follows:

```
import numpy as np
#simulate two taps with 512-point window and 256-point hop
window_size = 512
hop_size = window_size // 4
n_hops = 4
n_fft = 512
dummy_signal = np.random.randn(window_size+hop_size*n_hops)
dummp_signal_taps = [dummy_signal[idx:idx+window_size] for idx in range(0, hop_size*n_hops + 1, hop_size)]
win = np.hanning(window_size)
#fft with window function
dummy_signal_taps_fft = [np.fft.rfft(s*win, n=n_fft) for s in dummp_signal_taps]
#ifft with window function
dummy_signal_taps_reconstruct = [np.fft.irfft(s_fft, n=n_fft) * win for s_fft in dummy_signal_taps_fft]
#reconstruct the original signal
dummy_signal_reconstruct = np.zeros(window_size+hop_size*n_hops)
for i,s in enumerate(dummy_signal_taps_reconstruct):
dummy_signal_reconstruct[i*hop_size:i*hop_size+window_size] += s
#now check the reliable part of the reconstructed signal
dummy_signal_reliable = dummy_signal[(n_hops-1)*hop_size:-(n_hops-1)*hop_size]
dummy_signal_reconstruct_reliable = dummy_signal_reconstruct[(n_hops-1)*hop_size:-(n_hops-1)*hop_size]
print(dummy_signal_reconstruct_reliable / dummy_signal_reliable)
```

The printed results will show that, after fft and ifft with hanning window and a hop length of a quarter of the window size, the reconstructed signal is scaled by a factor of approximately 1.5.

Thank you for the information! I will give it a look!

Hmmm in fact you are right:

```
# Normalize by sum of squared window
ifft_window_sum = window_sumsquare(
window=window,
n_frames=n_frames,
win_length=win_length,
n_fft=n_fft,
hop_length=hop_length,
dtype=dtype,
)
approx_nonzero_indices = ifft_window_sum > util.tiny(ifft_window_sum)
y[..., approx_nonzero_indices] /= ifft_window_sum[approx_nonzero_indices]
```

There is a normalization step which I think aims to compensate the windowing (as itâ€™s only used when computing the stft but not the fft). Anyway should look into it in depth.

Thanks! I will have a look on it.