How are unfold and fold implemented?

I want to refer to fold and unfold in torch and use numpy to implement unfold and fold operations.

When I followed the torch source code, I found: torch/nn/modules/>>torch/nn/>>torch/nn/_functions/thnn/
input, output,
kernel_size[0], kernel_size[1],
dilation[0], dilation[1],
padding[0], padding[1],
stride[0], stride[1])

This seems to be the implemented by C, and I can’t check it.

So I am curious about how to implement the unfold and fold?

When I read other people’s implementation, im2col was performed through multiple for loops sliding

I want to speed up the im2col by matrix multiplication, like AXB where X is the input and A and B is auxiliary matrixs. Hope you can give me some suggestions

You can find the CPU implementation here and the CUDA implementation here.