[ImageNet]A question about Data Augmentation

The first form of data augmentation consists of generating image translations and horizontal reflections. We do this by extracting random 224 × 224 patches (and their horizontal reflections) from the
256×256 images and training our network on these extracted patches4. This increases the size of our
training set by a factor of 2048, though the resulting training examples are, of course, highly interdependent. Without this scheme, our network suffers from substantial overfitting, which would have
forced us to use much smaller networks. At test time, the network makes a prediction by extracting
five 224 × 224 patches (the four corner patches and the center patch) as well as their horizontal
reflections (hence ten patches in all), and averaging the predictions made by the network’s softmax
layer on the ten patches.

I do not know why make the input images are 224x224x3
The passage is from "ImageNet Classification with Deep Convolutional
Neural Networks "of 4.1

My best guess is that this shape is somehow the “standard” for ImageNet-related models.
AlexNet used it and a lot of models adapted the shape.

The models were originally written in the sense that the input shape had to be constant, which would also explain a specific shape which “just works”.
Today you would most likely find model implementations with adaptive pooling layers, which allow different input shapes.

1 Like