However, this doesn’t tell us what the input shape is. In mnist, the shape is (1, 1, 28, 28), but how do we know the input shape from the model definition (let’s say we don’t know the model is for mnist)? I couldn’t find any info about it.
This only shows us the Modules in model and not the forward()
function that “glues” the Modules together and that may well perform
other non-trivial processing. Let’s analyze this making the simplest
assumptions about any missing processing.
fc1 has in_features = 9216. There must be some sort of flatten()
or reshape just before this. Convolutions work on images of
arbitrary sizes. The output of conv2 has out_channels = 64, so
the output of conv2 has shape [batch_size, 64, H, W]. Assuming
(correctly) that the batch size just goes along for the ride, after the flatten() operation, we have 64 * H * W “features” to pass into fc1.
Therefore H * W must be 9216 / 64 = 144. Guessing that the
image is square, it would have shape [64, 12, 12]. (It doesn’t
have to be square; it could have a shape of, say, [64, 9, 16],)
Each Conv2d layer (no padding, kernel_size = 3, stride = 1)
trims two rows of pixels off of the image. Therefore the input to conv1 must have shape [batch_size, 1, 16, 16].
In general, there will be multiple places in a model where the shape
of the tensor is constrained. (In this case, the input to fc1 has to
have in_features = 9216.) Then you work backwards from the
constraint see what input shapes would be valid for your model.
Based on our assumptions, [1, 1, 28, 28] wouldn’t be valid for this
model. After the first two Conv2d layers, that shape would become [1, 64, 24, 24]. You could hypothetically have a factor-of-two
downsampling layer between conv2 and fc1 to take you down to [1, 64, 12, 12], but you would typically perform downsampling
between convolutional layers, rather than after them.
(You could also have an AdaptiveAvgPool2d (12, 12) between conv2 and fc1, which would be a pretty common way to make the
model much more flexible with respect to input shape.)