I have been working on what I think should be a ridiculously simple problem. Let’s say I have a 2D matrix 5x5. I would like to take each 3x3 patch, flatten the patch (to 9 elements), then put the flattened patch through a network that spits out a single number. So the 5x5 matrix should become a 3x3 matrix. Each individual number in this resulting 3x3 matrix came from running a 3x3 patch through a network. I’m also cool with a 5x5 matrix result if adding padding simplifies the solution. The solutions I am coming up with create patches (using unfold) which I have to convert and then concat back together making sure the spatial relationship between the input and ouput matrix is preserved. What am I missing here?

What sort of network are you applying to the 3x3 patches? What you’re describing is effectively 2D convolution, but instead of a linear/affine operator applied to the patch you’re applying a network operator.

I’m flattening the 3x3 patches to 9 and then running it through a linear layer that creates a single value.

If each patch is using the same linear layer, then this is exactly a 2D convolution.

Are you aware of any code/blog/whatever that shows an example of implementation then?

You are not missing anything and your explanation seems fine.

As @bdh said, it’s comparable to a conv or pooling layer.

Here is a small example for your use case:

```
image = torch.randn(1, 1, 5, 5)
kh, kw = 3, 3 # kernel size
dh, dw = 1, 1 # stride
patches = image.unfold(2, kh, dh).unfold(3, kw, dw)
patches = patches.contiguous().view(-1, 9)
lin = nn.Linear(9, 1)
output = lin(patches)
output.shape
output = output.view(1, 1, 3, 3)
```

1 Like

Cool! Definitely a lot simpler than the stuff I was coming up with. Thanks!