Intermediate convolution memory consumption

Hi, I’ve been trying to compare the memory consumption between a standard sliding windows (space domain) approach and a frequency domain approach (Fourier and element-wise product).

Considering a image of dimensions HxW and a kernel with dimensions KhxKw, we would have:

Fourier: In frequency domain the kernel must have the same size as image and assuming we store real and imaginary parts in different channels we have HxWx2 for the filter, HxWx2 for the result. In total HxWx4.

Sliding window: Here we have HxW for the result and KhxKw, which seems much less than HxWx4 because filters are usually small. However, I’m not sure about how much memory would be required for the operation itself. I mean storing intermediary results for the sliding window. I’ve tried to look for it online but I couldn’t find anything precise on how the convolutions are performed.

Can someone, please, enlighten me on this? And what about backprop?