STFT output shape


I am confused about the output shape from STFT. Given

print (y.shape)
s = torch.stft(y, frame_length=128, hop=32)
print (s.shape)

we have

torch.Size([3, 7936])
torch.Size([3, 245, 65, 2])

According to the doc, “Returns the real and the imaginary parts together as one tensor of size (∗×N×2), where ∗ is the shape of input signal, N is the number of ω s considered depending on fft_size and return_onesided, and each pair in the last dimension represents a complex number as real part and imaginary part.”, * is the shape of input signal, so I would expect we have a returned tensor with shape [3, 7936, N, 2]. So, we is “245” computed given input length “7936”.


It’s similar to the sliding windows of a convolution.
Have a look at the output shape formula for Conv2d.
Given your input size of 7936, your “kernel”, or in this case your frame, has a length of 128, with a stride or hop of 32:

((7936 - (128 - 1) - 1) / 32) + 1 = 245

Cool, thanks a million.