Implicit asymmetric padding in Conv2d?


I have been banging my head against this for one hour already. In the original Resnet paper they say that the first layer of all residual networks starts with a convolutional layer with filters of size 7x7 and stride=2. In order to make that work, in from torchvision this is implemented with a padding of 3 pixels:

self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=7, 
                                      stride=2, padding=3, bias=False)

Now, according to the standard formula for input/output in conv2d layers, if the input is the standard 224x224 imagenet size, one should get out of the above operation an output of size (224-7+2x3)/2. The point is that in this formula the numerator is uneven. So how is it that this does not even crash?

Summarizing, here is the question. Why does this even work:

import torch

conv1 = torch.nn.Conv2d(in_channels=3, out_channels=3, kernel_size=7, 
                                         stride=2, padding=3)

tensor = torch.ones([1,3,224,224])


and returns a tensor of spatial size 112x112? Is Conv2d performing some kind of padding but only to the right and bottom of the input volume?

Many thanks!


In those situations Conv2d discards the last columns/rows of the input, after padding, as mentioned in PyTorch documentation: Conv2d.

In the particular case you are mentioning, that is the same as having a right/bottom padding of 2 instead of 3, as the last column/row is discarded. Translating to the formula that you mentioned to compute the output dimension of the convolution we get (224-7 + 3+2) / 2 + 1 = 112.

That makes sense, thank you!