Stride in conv_transpose1d

My expectation of output from conv_transpose1d does not match the actual output from pytorch. In fact, the dimension of the output does not match my expectation, which makes me wonder: what exactly does stride do?

Here’s an example:

>>> from torch import Tensor, nn
>>> nn.functional.conv_transpose1d(Tensor([[[1,2,3,4]]]), Tensor([[[1,0,0,0]]]), padding=3, stride=2, dilation=1)
tensor([[[0., 3., 0., 4.]]])

With padding=3, I expect the input to be unchanged (i.e. no padding), and the output to be 4*1=4. But the output is [0, 3, 0, 4]. Why?

Your input is a sequence length of 4. Padding of 3 added to both sides and that length becomes 10. Your kernel size is 4. Stride is 2. So the kernel is applied to indices as follows:

  1. 0, 1, 2, 3
  2. 2, 3, 4, 5
  3. 4, 5, 6, 7
  4. 6, 7, 8, 9

Thus, 4 outputs.

The documentation for convtranspose1d says that the padding parameter is not the actual padding applied to the input. The actual padding applied to the input is dilation*(kernel_size-1)-padding which in this example would be 1*(4-1)-3=0, i.e. no padding. So my expectation is that the input should be unchanged and to remain the same dimension as the kernel.

Please see here:

https://pytorch.org/docs/stable/generated/torch.nn.ConvTranspose1d.html

Yes the output dimension is consistent with the output dimension shown in the Output documentation, but not with the rest of the documentation. The actual padding, as I noted in the previous comment, should be zero and hence the input should not be modified and have the same dimension as the kernel (hence my expectation of a scalar output).

Perhaps I can rephrase my question. The documentation provides no actual equation. What exact equation is being used to compute ConvTranspose1d with the given parameters in my contrived example?

I think you are confusing padding and output_padding. The part of the equation you describe is for output_padding (i.e. applied after the convolution). But your code set padding (i.e. applied before the convolution).

Please see my last post for the output size equation with your numbers inserted.

I think we may talking about two different "padding"s:

  1. padding function parameter,
  2. Applied “padding” of the input that you normally have within say a convolution or dft operation

The exact excerpt from the documentation of the referenced function regarding padding (the parameter, and not output_padding) reads:

paddingdilation * (kernel_size - 1) - padding zero-padding will be added to both sides of each dimension in the input. Can be a single number or a tuple (padW,). Default: 0

To add to the confusion, padding (parameter) in convtranspose and conolution (pytorch) functions behaves differently. In convolution functions, padding is actual expected padding. Based on the documentation of convtranspose function, padding (the parameter) is awkwardly not actually padding (as you referenced in your first response). The actual padding is given by the equation. In my example, dilation=0, kernel_size=4 and padding=3. So the actual zero “padding” size of the input should be (based on the documentation) InputPadding = 1*(4-1)- 3 = 0, i.e. no padding before convolution. out_padding is implicitly 0 hence no padding after convolution.

This means that the operated on arrays should be [1,2,3,4] and [1,0,0,0] (unchanged), hence my expectation that the output be a scalar 4*1 = 4.

It seems that the documentation is completely wrong and the implementation is doing something very different. Unfortunately looking up the actual implementation of the function seems to require building aten and I haven’t gotten around to doing it yet.

In the nn.ConvTranspose1d documentation, it states:

The padding argument effectively adds dilation * (kernel_size - 1) - padding amount of zero padding to both sizes of the input. This is set so that when a Conv1d and a ConvTranspose1d are initialized with same parameters, they are inverses of each other in regard to the input and output shapes. However, when stride > 1 , Conv1d maps multiple input shapes to the same output shape. output_padding is provided to resolve this ambiguity by effectively increasing the calculated output shape on one side. Note that output_padding is only used to find output shape, but does not actually add zero-padding to output.

(emphasis mine)

If you set the stride to 1 while maintaining padding = 3, you get an output size of 1.

Please disregard my initial reply. I was confusing it with Conv1d operations. In this case, you have the reverse operation. Working in reverse, if you started with an output size of 4, and added padding of 3 on each side, you have a size of 10. Then the operation would be performed on that sequence of size 10 as:

  1. 0, 1, 2, 3
  2. 2, 3, 4, 5
  3. 4, 5, 6, 7
  4. 6, 7, 8, 9

And so you get an input size of 4, as mentioned earlier.

But if you set stride to 1, and padding to 3, and start with an output size of 1, add 3 padding to both sides which makes a size of 7. Then you have:

  1. 0, 1, 2, 3
  2. 1, 2, 3, 4
  3. 2, 3, 4, 5
  4. 3, 4, 5, 6

And so you get an input size of 4.

In this way, you can think of “padding” as the reverse operation, thus cutting off that much of the end result.

Thanks for the response. That gave me a better understanding of how to view the “intention” of the function. I was still having issues replicating the outputs. So I dug into it more to try to “reverse engineer” the outputs with different parameters. It seems that the equation that represents the function behavior, for s > 1, can be described as (excluding other parameters):

$$
\tilde{X}_i := \begin{cases}
X[ \lfloor i, s \rfloor ] && i \mod s = 0 \
0 && \text{otherwise}
\end{cases}
$$

where i is in [0, s*(N-1)+1). Then the function output becomes the non-circular discrete convolution one may expect, something along the lines of:

$
C_{i} = \sum_j \tilde{X}_{i+j} * K^{T}_j
$

So with with s>0, the “padding” is not the padding one may expect in a convolution operation, but “commingling” of the zeros between elements.

I wish they had actually specified the equation in the documentation instead of the “verbal” description, which is insufficient, vague, and misleading.

Here are some visuals as to what is occuring in the Pytorch implementation, including when stride > 1: