Determine input dimension of image?


dimension of the images being trained are 32x32x3 =height x width x channel

I am unable to find where in this neural network class the height and width are mentioned, i am confused.

import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

net = Net()


As you are using convolutional filters, you don t have to specify the input height and with. Yet, as at the end of your classifiers there are linear layers, you can actually retreive the input size from the first linear layer input size which is here 16x5x5.

Here is the reasonning I use:
5x5 convolutional filters without padding and with stride=1 change size from HxWxC (layer input size) to (H-4)x(W-4)xC’ (layer output size)
2x2 max pooling layers devide the size of both dimensions by 2 so we go from HxWxC (layer input size) to H/2xW/2xC (layer output size)

Hence we just have to propagate back those informations from the first linear unit input size.

Since last conv layer has 16 filters, you know the output of the last conv layer has size: 5x5x16, so it goes as following inversing the transformations listed above:

There you go.

One last this to notice is that I used HxWxC notations while Pytorch convention is CxHxW but reasonning remains unchanged.

To conclude, the height, width information are implicit in this code and if you use fully convolutionnal network, not even necessary.

1 Like

@el_samou_samou Yes your are correct.

Here is what i could deduce, the architecture has to be understood before programming or training on pytorch.

Input image dimension HxWxC = n x n x c
filter f x f x cf
Output dimension = ((n+2p-f)/s)+1
p = padding
s = stride

here p=0 and s=1(default)
Here in this case
32x32x3 is the input image convolution with 5x5x6 filter which gives the output of 28x28x6
Maxpooling same equation for output dimension
28x28x6 is the input image , convolution with 2x2x6 filter which gives the output of 14x14x6 IMPORTANT: STRIDE IN MAXPOOLING EQUALS THE FILTER/KERNEL SIZE
Same for other convolution and Maxpooling
Convolution O/P = 10x10x16
Maxpooling O/P = 5x5x16

Hope it helps someone :slight_smile:


It seems correct. Good luck.