I have been banging my head against this for one hour already. In the original Resnet paper they say that the first layer of all residual networks starts with a convolutional layer with filters of size 7x7 and stride=2. In order to make that work, in
torchvision this is implemented with a padding of 3 pixels:
self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=7, stride=2, padding=3, bias=False)
Now, according to the standard formula for input/output in conv2d layers, if the input is the standard 224x224 imagenet size, one should get out of the above operation an output of size (224-7+2x3)/2. The point is that in this formula the numerator is uneven. So how is it that this does not even crash?
Summarizing, here is the question. Why does this even work:
import torch conv1 = torch.nn.Conv2d(in_channels=3, out_channels=3, kernel_size=7, stride=2, padding=3) tensor = torch.ones([1,3,224,224]) print(conv1(tensor).shape)
and returns a tensor of spatial size 112x112? Is
Conv2d performing some kind of padding but only to the right and bottom of the input volume?