MaxPool2D Bug in Backward Pass

Denis_Khabarov · October 23, 2024, 12:55pm

Hi there! I guess there might be a bug in MaxPool2d backward pass.
I have m1 = F.max_pool2d(r1, kernel_size=3, stride=2).
I expect r1.grad to be filled with values from m1.grad, but there is one value that is taken from nowhere (highlighted on screenshot)

ptrblck · October 23, 2024, 3:53pm

This is expected since you are using overlapping windows which will accumulate the gradient into duplicated values.
E.g. take a look at this random example using your shapes:

import torch
import torch.nn.functional as F

r1 = torch.randn(1, 3, 7, 7, requires_grad=True)
print(r1)
# tensor([[[[-1.4727,  0.9768, -0.3311,  0.8737,  0.0744, -2.0515,  0.7500],
#           [ 1.3699,  1.1092,  0.1827,  0.6149, -1.2941, -0.8750, -0.8603],
#           [-0.0599, -0.6267,  0.7734, -0.3923, -0.9799,  0.6560, -0.9698],
#           [-1.7298, -2.7498,  0.0206,  2.0452,  0.3059, -1.3612, -0.2122],
#           [ 0.6241,  3.4831, -0.0442,  1.9328, -0.3648,  2.7771, -0.1683],
#           [ 0.0857, -0.2046,  2.1729, -1.7988, -2.1140, -0.4294,  1.2538],
#           [-0.4254, -2.8631,  0.6067,  1.1106, -1.0001, -0.3954,  0.6482]],

#          [[-0.3228,  1.4671, -0.1583, -0.0143,  0.6897, -0.1155, -1.1501],
#           [-0.9122,  0.7400,  1.4444,  1.5633,  0.8384,  0.2486, -1.1106],
#           [ 2.1963, -0.6138, -1.1844,  1.2009,  0.7253, -0.4287,  0.6641],
#           [-1.6240,  1.1937, -0.5235,  2.1328,  0.8682, -1.3029,  0.6951],
#           [-0.9053, -0.2419, -0.3201, -0.9616,  0.8365, -1.9191,  0.9880],
#           [ 1.7853,  1.4329, -0.2436, -0.9494, -0.5477, -0.0317,  1.4060],
#           [ 0.2019,  1.8628, -0.6363,  0.6255,  0.0776, -0.2312,  0.3852]],

#          [[ 0.5682, -1.0610, -0.3535, -0.4028, -3.1006,  0.8192,  0.8052],
#           [-0.2114, -1.0538,  1.0613, -0.2747,  0.2004,  0.2594,  1.9385],
#           [ 0.3455,  0.9690, -1.2378,  0.6012, -0.1848,  1.8598, -0.8459],
#           [ 0.2934, -0.1465,  0.0372, -0.4022, -0.1241, -1.5513,  0.5390],
#           [ 0.0278, -0.9802,  0.9465,  1.6141, -1.7087, -1.1682, -0.5648],
#           [ 1.5077,  0.6494, -0.3490, -0.1972,  0.3168, -1.1350, -0.1763],
#           [-2.6766,  0.8964,  0.5659, -0.2317,  0.5457,  0.6722,  1.3882]]]],
#        requires_grad=True)
m1 = F.max_pool2d(r1, kernel_size=3, stride=2)
print(m1)
# tensor([[[[1.3699, 0.8737, 0.7500],
#           [3.4831, 2.0452, 2.7771],
#           [3.4831, 2.1729, 2.7771]],

#          [[2.1963, 1.5633, 0.8384],
#           [2.1963, 2.1328, 0.9880],
#           [1.8628, 0.8365, 1.4060]],

#          [[1.0613, 1.0613, 1.9385],
#           [0.9690, 1.6141, 1.8598],
#           [1.5077, 1.6141, 1.3882]]]], grad_fn=<MaxPool2DWithIndicesBackward0>)
m1.retain_grad()
m1.mean().backward()
print(m1.grad)
# tensor([[[[0.0370, 0.0370, 0.0370],
#           [0.0370, 0.0370, 0.0370],
#           [0.0370, 0.0370, 0.0370]],

#          [[0.0370, 0.0370, 0.0370],
#           [0.0370, 0.0370, 0.0370],
#           [0.0370, 0.0370, 0.0370]],

#          [[0.0370, 0.0370, 0.0370],
#           [0.0370, 0.0370, 0.0370],
#           [0.0370, 0.0370, 0.0370]]]])
print(r1.grad)
# tensor([[[[0.0000, 0.0000, 0.0000, 0.0370, 0.0000, 0.0000, 0.0370],
#           [0.0370, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
#           [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
#           [0.0000, 0.0000, 0.0000, 0.0370, 0.0000, 0.0000, 0.0000],
#           [0.0000, 0.0741, 0.0000, 0.0000, 0.0000, 0.0741, 0.0000],
#           [0.0000, 0.0000, 0.0370, 0.0000, 0.0000, 0.0000, 0.0000],
#           [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]],

#          [[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
#           [0.0000, 0.0000, 0.0000, 0.0370, 0.0370, 0.0000, 0.0000],
#           [0.0741, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
#           [0.0000, 0.0000, 0.0000, 0.0370, 0.0000, 0.0000, 0.0000],
#           [0.0000, 0.0000, 0.0000, 0.0000, 0.0370, 0.0000, 0.0370],
#           [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0370],
#           [0.0000, 0.0370, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]],

#          [[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
#           [0.0000, 0.0000, 0.0741, 0.0000, 0.0000, 0.0000, 0.0370],
#           [0.0000, 0.0370, 0.0000, 0.0000, 0.0000, 0.0370, 0.0000],
#           [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
#           [0.0000, 0.0000, 0.0000, 0.0741, 0.0000, 0.0000, 0.0000],
#           [0.0370, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
#           [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0370]]]])

Now look for duplicates in the output m1 and you will find e.g. 3.4831 in the first channel which is picked as the max value in two windows. This value can be found in r1[0, 0, 4, 1] and you will see that r1.grad[0, 0, 4, 1] was accumulated (i.e. doubled in this example).

Denis_Khabarov · October 24, 2024, 4:18am

Got it! Thank you very much for your help