Understanding Transposed Convolution

JamesDickens · October 30, 2021, 10:37pm

I am trying to understand an example snippet that makes use of the PyTorch transposed convolution function, with documentation here, where in the docs the author writes:

“The padding argument effectively adds dilation * (kernel_size - 1) -
padding amount of zero padding to both sizes of the input.”

Consider the snippet below where a [1, 1, 4, 4] sample image of all ones is input
to a ConvTranspose2D operation with arguments stride=2 and padding=1 with a weight matrix of shape [1, 1, 4, 4] that has entries from a range from 1 to 16 (in this case dilation=1, added_padding = 1*(4-1)-1 = 2)

import torch.nn as nn
sample_im = torch.ones(1, 1, 4, 4).cuda()
sample_deconv2 = nn.ConvTranspose2d(1, 1, 4, 2, 1, bias=False).cuda()
sample_deconv2.weight = torch.nn.Parameter(torch.tensor([[[[1., 2., 3., 4.], [5., 6., 7., 8.], [9., 10., 11., 12.], [13., 14., 15., 16.]]]]).cuda())
x = sample_deconv2(sample_im)
print(x)

Which yields:

tensor([[[[ 6., 12., 14., 12., 14., 12., 14.,  7.],
          [12., 24., 28., 24., 28., 24., 28., 14.],
          [20., 40., 44., 40., 44., 40., 44., 22.],
          [12., 24., 28., 24., 28., 24., 28., 14.],
          [20., 40., 44., 40., 44., 40., 44., 22.],
          [12., 24., 28., 24., 28., 24., 28., 14.],
          [20., 40., 44., 40., 44., 40., 44., 22.],
          [10., 20., 22., 20., 22., 20., 22., 11.]]]], device='cuda:0',
       grad_fn=<CudnnConvolutionTransposeBackward>)

Now I have seen simple examples of transposed convolution without stride and padding, for instance if the input is a 2x2 image [[2, 4], [0, 1]], and the convolutional filter with one output channel is [[3, 1], [1, 5]], then the resulting tensor of shape [1, 1, 3, 3] can be seen as the sum of the rightmost 4 matrices in the image below:

The problem is I can’t seem to find examples that use striding or padding in the same visualization. As per my snippet, I am having a very difficult time understanding how the padding is applied to the sample image, or how the stride works to get this output. Any insights appreciated, even just understanding how the 6 in the (0,0) entry or the 12 in the (0,1) entry of the resulting matrix are computed would be very helpful.

tom · October 31, 2021, 3:21am

One authoritative way to look at transposed convolutions is to look at how convolutions operate as banded matrices on flattened images. Then the transposed convolution is just applying the transposed matrix to something of the output shape. For example, Dumoulin and Visin do this in their famous explanation.

The other thing you can do is to recall that the transposed convolutions are there to provide the adjoint operation of convolution for computing the derivative.
The adjoint of summation is expansion, so you re-use the same input value of the transposed convolution corresponding to a given output of the forward convolution for all output values of the transposed convolution that correspond to the input values of the forward convolution that are summed to form that output.
The adjoint of expansion, i.e. using the same input value several times, is summation. So if a given pixel is used several times, the adjoint will sum over the corresponding forward output locations.
In between you need to multiply with the corresponding weight in the summation stencil

In your case: By the stride and the padding, the top left input pixel to the forward convolution is only used in the top left output and multiplied with the (1, 1) weight element. Similarly, the (0,1) entry is 12=5 + 7 as that pixel would show up in two locations (the top left and the pixel right to it because the stride has it appear twice). The (1, 0) pixel on the other hand is 12=2+10 by the same logic.

Best regards

Thomas