Problem about nn.Linear(16 * 5 * 5, 120)

I am new to PyTorch, and I feel confused about the reason of 16×5×5 in the codes below. Can anyone tell me why we choose 16×5×5 here? Thank you:)

import torch
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        # 1 input image channel, 6 output channels, 5x5 square convolution
        # kernel
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        # an affine operation: y = Wx + b
        **self.fc1 = nn.Linear(16 * 5 * 5, 120)**
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features

net = Net()
input = torch.randn(1, 1, 32, 32)
out = net(input)

The number of input features to your linear layer is defined by the dimensions of your activation coming from the previous layer.
In your case the activation would have the shape [batch_size, channels=16, height=5, width=5]. In order to pass this activation to nn.Linear, you are flattening this tensor to [batch_size, 16*5*5].
The 16 is defined by the number of out_channels (i.e. number of filter kernels in the previous conv layer), while 5x5 is the spatial size defined by the conv and pooling operations performed on your input data.

Let me know, if you need more information!

1 Like

Thank you! Is the meaning of 5×5 the same as 28×28 in the MINST example? If so, why 5×5 here?

Got it! Because the input is 1×32×32, after the conv and pooling operations, the size becomes 16×5×5. Thank you very much for your help!

Is the out_channels * kernel_size[width] * kernel_size[hight] from the previous layer right?

Yes, the out_channels are defined by the previous layer. No, the kernel size of the previous layer won’t be used directly, but the spatial size of the output activation of the previous layer.

1 Like