Input size of fc layer in tutorial?

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool  = nn.MaxPool2d(2,2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1   = nn.Linear(16*5*5, 120)
        self.fc2   = nn.Linear(120, 84)
        self.fc3   = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16*5*5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

net = Net()

In Pytorch tutorial we have the above network model, but I was wondering about the input size of the first fully connected layer - 16 * 5 * 5.
First I think the 16 refers to the output channel of the last conv layer, yet I am not convinced that
x = x.view(-1, 1655)
actually flatten the tensor by their channel. So is my understanding correct?

Also I don’t know why there is the 55: Is it referring to the kernel size in the above conv layer? However I think the tensor we would like to send to fc layer is the image of 16 channels, so 55 doesn’t have any specific meaning here?


I expect the output of the conv layers to have 16 channels, and width = height = 5.
In which case, x.view(-1, 16*5*5) flattens the channels, height and width.

How come we know the output width and height of a conv layer? I think we can only know the output channel of a conv layer? The width and height should depend on the original image right?

This is obviously one of those CNN models that requires the input images to be resized properly before being input into the model.

yes, so the width and height changes along with the input image size.

Typically, we cound define a function to calculate the size automatically.

def num_flat_features(self, x):
    size = x.size()[1:]  # all dimensions except the batch dimension
    num_features = 1
    for s in size:
        num_features *= s
    return num_features

then we can use

x = x.view(-1, self.num_flat_features(x))
instead of
x.view(-1, 1655)

But then your linear layer would be the wrong size.

what about 120 and 84 for next layer? how are they selected?

one thing i want to know is, in my understanding we are flattening the features in last convolution then the neural net if for
if that’s correct, why don’t we just make the neural net for

The problem is that the model that you build is not dynamic for different size of images, you have to calculate the flatten size to connect the conv layer with the fully connected layer… Because you will never know it unless you see your input image, x…

I forgot how TensorFlow deal with this, but obviously, this is not very practical to generalize the model for any image…


The input of a Pytorch Neural Network is of type [BATCH_SIZE] * [CHANNEL_NUMBER] * [HEIGHT] * [WIDTH].

Example : So lets assume you image is of dimension 1×3×32×32 meaning that you have 1 image with 3 channels (RGB) with height 32 and width 32. So using the formular of convolution which is ((W - F + 2P)/ S )+1
and ((H - F + 2P)/ S )+1 . The first one is for the Width and the second one is for the height
NOTE : Delete the the 5th line of the code because you already have pooling in the 12th line


with our input 1×3×32×32 after applying conv1 W will be 28 and H will be 28 and also applying (2,2) pooling halves the WIDTH and HEIGTH and We have 6 feature maps. So after the first con2d and pooling we end up with an image of dimension
1 * 6 * 14 * 14. Similaly for the second conv2d and pooling we will end up with an image of dimension 1 * 16 * 5 * 5 . Finally since we need a column vector for the first fc layer we should unroll our vector which is 16×5×5 = 400

NOTE : Refer to this post which is a similar question
[Linear layer input neurons number calculation after conv2d]