CNN input image size formula

Hi, I’ve read and searched and read some more on the forum, but I can’t understand the following:
how do I calculate and set the network’s input size, and what is its relation to image size?

I have an AlexNet clone (single channel 224 x 224) which I want to now use with a single channel 48 x 48 greyscale image:

class alexnet_custom(nn.Module):

    def __init__(self, num_classes=2):
        super(alexnet_custom, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 64, kernel_size=11, stride=4, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(64, 192, kernel_size=5, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(192, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
        )
        self.classifier = nn.Sequential(
            nn.Dropout(),
            nn.Linear(256 * 6 * 6, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Linear(4096, num_classes),
        )

Is there some clear and concise way to understand the relation between the tensor input and the image size, and how that is related to channels, height, width (and batch size)?

4 Likes

tensor input and the image size
single channel 224 x 224

224 x 224 is the input size of the image.
so if you want your image to be 48x48 your input tensor should be (batchsize, channels, 48,48)

@dhecloud how does that translate to the CNN input in

self.features = nn.Sequential(...

That what is not clear no matter how much googling I do :frowning:

the layers are specific and unique to each architecture. you can google their papers if you want to know more.

your input (batchsize, 1, 224, 224) is fed through the layers in sequence - ie nn.Conv2d(1, 64, kernel_size=11, stride=4, padding=2), then nn.ReLU(inplace=True), and so on.

If you are modifying the original architecture, then you also have to make sure the resulting matrix operations are legal

So from what I understood, the Conv layer’s do not specify the image input size?
How do I calculate the matching dimensions of nn.Conv especially the first one, knowing I have a 1 channel 48 * 48 image? Or is that not relevant as a question?

So from what I understood, the Conv layer’s do not specify the image input size?

yes, they do not specify the input size. the input channel of the 1st conv layer has to match the number of channels for the image though. but they do affect the output tensor shape.

How do I calculate the matching dimensions of nn.Conv especially the first one, knowing I have a 1 channel 48 * 48 image? Or is that not relevant as a question?

I’m not sure what you mean. you mean how should you choose the parameters of the filters? There’s a bit of math required for this, you should read up on cs231n. If you are asking how to calculate the output tensor shape, then the formula for that is on the conv2d documentation

1 Like

Ok, I got it, I was for some reason under the impression the first Conv layer would have a relation to the input size.
You can disregard my question, many thanks for the help.

If you change the input size, the size of the last convolutional layer that is reshaped to connect to the first fully connected layer changes. in that case the input size of the first fully connected layer should change. For e.g. for alexnet taking in 224x224 you get 6x6 in the last conv layer, for 48x48 you might expect this to be 2x2 (guessing, the best way would be to pass the image through the features part and see the size of the output)

2 Likes

A network whose first layer is Conv2d will have an input size of (batchsize, n_channels, height, width).

Since Convolutional layers in PyTorch are dynamic by design, there is no straightforward way to return the intended/expected height and width, and in fact (subject to remaining a valid size after unpadded convolutions and poolings etc), any image size may be acceptable to a module composed completely of convolutions.

If the network subsequently contains an e.g. Linear layer with a fixed input size parameter, any image with size (height/n, n*width) should be acceptable input to the network (subject to the same above conditions, where height and width were the intended dimensions of the input).

1 Like