Pooling with a custom kernel?

shroomite · March 10, 2025, 11:11am

In my current application I would like to perform max pooling with a diamond kernel. This is possible by performing 3x1 and 1x3 max pools in the dimensions of both axes, and then selecting the element-wise max of both, but this is wasteful as it needs to process the image twice.

I would ideally like to pool only once, with a diamond kernel. I don’t think the max pooling operation supports kernel input, but I would ideally like to be able to do something like this:

pool = nn.functional.max_pool2d(x, kernel=[[0,1,0],[1,1,1],[0,1,0]], stride=1, padding=1)

I tried to mimic this behaviour by convolving with the same kernel ([[0,1,0],[1,1,1],[0,1,0]]), but I end up with a sum of the kernel and I’m not sure how I could implement the convolution to use max instead of sum to reduce the elements. Is it even possible? I need the solution to work with 3D images too, and it seems unfold only supports 2D images.

I would be thankful for any help with this issue!

KFrank · March 10, 2025, 3:10pm

Hi Karol!

I don’t think that pytorch supports what you want with a single operation.

As an alternative to max-pooling along the two axes and then computing the
element-wise max of the two results, you could construct a Conv2d with five
out_channels, where each channel of its kernel is just a 3x3 slice with a single
1.0 corresponding to each of the five 1.0s that appear in your “diamond kernel.”
Then perform a max-pool along the channel dimension of you convolved image.

I don’t have any intuition about about which approach would be faster, but I would
expect a single “diamond-kernel” max-pool with a well-tuned custom implementation
(at the c++ or cuda level) to be faster than either of the two multi-step approaches.
But writing such a custom implementation would be a project in its own right.

Having said that, as long you’re not looping (in python) over the pixels in your images,
pytorch should be quite fast in applying the multi-step max-pools and/or convolutions.

You might try timing both, and I would expect that one (or both) will prove fast enough
for your use case.

Best.

K. Frank