Custom MaxPooling

I would like to define a custom layer which works a bit like MaxPooling, but is different in the sense that it doesn’t have a constant kernel size. Let me try to explain through an example.

Given input of shape (1, 7), I would like to perform MaxPooling, but not with a fixed window size, however on a custom set of windows. For example: [0, 1, 2], [1, 2, 3], [2, 4], [3, 5], [4, 5, 6].

It would have been ideal if I could have a max activation function. In that case, I would have had a weight matrix which looked as follows:

[[1, 1, 1, 0, 0, 0, 0]
 [0, 1, 1, 1, 0, 0, 0]
 [0, 0, 1, 0, 1, 0, 0]
 [0, 0, 0, 1, 0, 1, 0]
 [0, 0, 0, 0, 1, 1, 1]]

This would have resulted in 5 neurons which compute

max(input[0], input[1], input[2])
max(input[1], input[2], input[3])
max(input[2], input[4])
max(input[3], input[5])
max(input[4], input[5], input[6])

However, since max is non-differentiable, I do not know how to implement such a custom layer with max as an activation function.

I would be extremely grateful if anyone could share some pointers regarding this.

Max only backprops through those layers which are maximum.
Anyway you have AdaptativeMaxPooling, which allows you to choose the final size. That’s flexible enough to perform any operation I guess.

Does it fit your scheme?

No, AdaptiveMaxPooling does not fit my scheme. I would like to explicitly choose/mask the neurons which are considered by each sliding of the kernel.

Perhaps I’m just a beginner with PyTorch, but my current problem stems from the fact that I cannot find how to write an explicit backward() for max_pool2d. I think if found a way to do that, I can implement my custom pooling based on it.

As I aforementioned you don’t really need to create your custom backward if you can write ops in function of pytorch ops.
torch.max already provides proper backward. Obviously the gradients flow through those features which are maximum.
I don’t understand at all what do you mean by

Given input of shape (1, 7) , I would like to perform MaxPooling, but not with a fixed window size, however on a custom set of windows. For example: [0, 1, 2], [1, 2, 3], [2, 4], [3, 5], [4, 5, 6].

Do you have a kernel size of 3 and then 2 and then3 again for a 2D input?
You can slide a kernel using torch.fold

Thanks a lot for following up!

Yes, my kernel not only varies in size, but also operates on non-adjacent neurons.

In essence, I’m trying to do the following: given a layer taking as input a tensor of the shape (3, 10, 10), I flatten it (i.e. input shape is a tensor of shape (75)), then throw away some neurons because I deem them unnecessary, and then I want to perform the “partial” max pooling on the result. I can use unfold to figure out how I would have slid the kernel on the flattened layer, however, if some of the neurons are no longer present in the layer, I do not know how to do the MaxPooling on the remaining.

Maybe my earlier example was a bit too hand-wavy. Let me try to be a little more concrete.

Suppose I have the following input

[[a0, a1, a2]
 [a3, a4, a5]
 [a6, a7, a8]]

MaxPooling with a kernel of size (2,2) will produce the max over the following windows

[[a0, a1]
 [a3, a4]]

[[a1, a2]
 [a4, a5]]

[[a3, a4]
 [a6, a7]]

[[a4, a5]
 [a7, a8]]

Now suppose, I had flattened my input

[a0, a1, a2, a3, a4, a5, a6, a7, a8]

Now I can think of the windows as following

[a0, a1, a3, a4]

[a1, a2, a4, a5]

[a3, a4, a6, a7]

[a4, a5, a7, a8]

Suppose, I deemed that neuron which processes a3 was unnecessary. So I prune it away while keeping all the other layers intact. Now you can imagine how the MaxPooling would change. The windows would be

[a0, a1, a4] instead of [a0, a1, a3, a4]

[a1, a2, a4, a5] remains the same.

[a4, a6, a7] instead of [a3, a4, a6, a7]

[a4, a5, a7, a8] remains the same.

Okaay now I get it.
Please observe the following snippet:

import torch
torch.manual_seed(10)
a =torch.Tensor(list(range(0,5))).requires_grad_()
b =torch.Tensor(list(range(5,10))).requires_grad_()

optim = torch.optim.SGD([a,b],lr=1)
def run(c):
    optim.zero_grad()
    c.backward()
    optim.step


print(a)
print(a.sum())
print(b)
print(b.sum())

c = torch.max(a.sum(),b.sum())
run(c)
print(a.grad)
print(b.grad)
print("grads doesn't flow through a as b is maximum, this is expected ")
print("Now let's suppose that you want to prune first 2 elements of b")

c = torch.max(a.sum(),b[2:].sum())

run(c)
print(a.grad)
print(b.grad)
print("As you can see gradients doesnt flow through b[0:1]")
print("same effect happens if you multiply the tensor element by 0, as derivative of constant*variable = constant*D_variable")

c = torch.max(a.sum(),(b*torch.Tensor([0,0,1,1,1])).sum())

run(c)
print(a.grad)
print(b.grad)
tensor([0., 1., 2., 3., 4.], requires_grad=True)
tensor(10., grad_fn=<SumBackward0>)
tensor([5., 6., 7., 8., 9.], requires_grad=True)
tensor(35., grad_fn=<SumBackward0>)
tensor([0., 0., 0., 0., 0.])
tensor([1., 1., 1., 1., 1.])
grads doesn't flow through a as b is maximum, this is expected 
Now let's suppose that you want to prune first 2 elements of b
tensor([0., 0., 0., 0., 0.])
tensor([0., 0., 1., 1., 1.])
As you can see gradients doesnt flow through b[0:1]
same effect happens if you multiply the tensor element by 0, as derivative of constant*variable = constant*D_variable
tensor([0., 0., 0., 0., 0.])
tensor([0., 0., 1., 1., 1.])

You just need to play with masking and torch.max to accomplish whatever

You can use this matrix as a torch.Tensor and perform a channel wise multiplication. This would force MaxPool to pick the values at instance location which have a 1 in the mask.