Overlapping Windows over Batched Audio

I have a Batched Audio Signal with the Shape [N, C, L] with e.g. N=4, C=4, L=2048, which I want to apply a Sliding Window with Overlap to, and later I want to reconstruct the original shape by using overlap and add.

Example:

Input Shape → [4, 4, 2048]
After Sliding Window with Window Size 256 and Overlap 128 → [4, 4, 15, 256]
Overlap and Add → [4, 4, 2048]

Using Tensor.unfold() works well for applying the sliding window but I haven’t found a way to overlap and add later on.
Also I played around with nn.Fold and nn.Unfold but if I understand correctly those transformation mess with the Channel Dimension which I don’t want.

Could you explain what “mess with the channel dimension” would mean in the usage of nn.Fold/Unfold?
Did you see any errors using these modules?

Thank you for your reply! When using nn.Fold, the Channel Dimension is divided by the product of kernel_size. Since I only have 4 Channels, I can use only very small Kernels so I don’t know how I can achieve for example a window size of 256 with 128 overlap.
I want the Channels to stay the same and only “cut” along the Signal Length.

This snippet does exactly what I want but I haven’t found a way to achieve the same behavior with nn.Fold/Unfold.

x = torch.arange(0., 16384).view(2, 4, 2048)
# shape: (2, 4, 2048)
x_framed = x.unfold(-1, 256, 128)
# shape: (2, 4, 15, 256)

nn.Unfold would return the same output if you reshape it accordingly:

x = torch.arange(0., 16384).view(2, 4, 2048)
# shape: (2, 4, 2048)
x_framed = x.unfold(-1, 256, 128)
# shape: (2, 4, 15, 256)


# module approach
unfold = nn.Unfold(kernel_size=(256, 1), stride=(128, 1))
out_ref = unfold(x.unsqueeze(3))
out_ref = out_ref.view(out_ref.size(0), 4, 256, -1).permute(0, 1, 3, 2).contiguous()

print((x_framed - out_ref).abs().max())
# > tensor(0.)
2 Likes