AlexNet Input Size: 224 or 227?

zhanwenchen · March 30, 2019, 4:42pm

When I looked at the source code for vision/AlexNet, I’m not sure whether the implementation assumes a 227 or 224 input size, because if you did use 227, the output of the first conv layer should be (227-11+2x2)/4 + 1 = 56 even. This would not have made much sense for the first pooling layer, which would produce an output size of (56 - 3) / 2 + 1 = 27.5.

One possible understanding is that the pool1 output is then floored, but this account differs from this Medium article, which suggests that in fact, conv1’s input may in fact be 224, and that the conv1 output is thus floored: (224 - 11 + 2x2) / 4 + 1 = 55.25, so that the pool1 output produces an integer without flooring: (55 - 3) / 2 + 1 = 27. Can someone please comment assertively whether the input should actually be 224 or 227? I believe this may be important for users who need this information to resize their images correctly.

ptrblck · March 30, 2019, 8:34pm

As described in the docs, the input size should be at least 224x224 and as far as I know this size was used for training as shown here.

All pre-trained models expect input images normalized in the same way, i.e. mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224.

mattgoh · May 11, 2022, 5:45pm

There are several sources that point out the same discrepancy. The video transcript here (I believe from Andrew Ng) states:

So, AlexNet input starts with 227 by 227 by 3 images. And if you read the paper, the paper refers to 224 by 224 by 3 images. But if you look at the numbers, I think that the numbers make sense only of actually 227 by 227.

In the Caffe file deploy.prototxt, the input dim is indicated to be 227x227:

layer {
  name: "data"
  type: "Input"
  top: "data"
  input_param { shape: { dim: 10 dim: 3 dim: 227 dim: 227 } }
}

There are other sources (from searching the web) that reference this discrepancy, and work through the calculations to support the claim.

As a side note, the AlexNet paper states that random cropping was used, whereas the PyTorch docs here use CenterCrop as an example