Resnet kernel_size=1 and stride=2

Dear pytorch community,

I noticed that the downsample method used for the resnet networks works with stride 2 convolutions. That is fine, but what worries me is that kernel_size is set to… 1!

Either kernel_size=1 or stride=2 would be okay, but together… Doesn’t that skip most of the image? My understanding is that kernel_size=1 and stride=2 looks something like:

xoxo
oooo  ...
xoxo
oooo
  ...

where all the o’s don’t matter at all, so 3/4 of the image doesn’t matter. I know max_pooling also discards information, but it looks at it and decides which ones to discard. This doesn’t even look at the pixels in ‘o’ positions.

If you don’t believe me, do this:

import torchvision
torchvision.models.resnet18()

and you get a big description, including lines like:

(0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)

Hi

I think that’s the solution proposed in the original paper “Deep Residual Learning for Image Recognition”.
In the experiments section they show that 1x1 kernels with stride=2 work fine when size reduction is required.

In my opinion it makes sense since the operations performed are equivalent to:

  • a downsample x2 of the input feature map using ‘nearest’ interpolation method
  • a linear projection of the input feature map to a higher-dimension (or lower-d) feature space

Hope it helps