Explaination of Conv2d


I was trying to follow this tutorial. But I’m not fully understanding the following section of the code. The code used mnist data, of size 28 x 28 x 1 & 10 classes.

self.conv = torch.nn.Sequential()
        self.conv.add_module("conv_1", torch.nn.Conv2d(1, 10, kernel_size=5))
        self.conv.add_module("maxpool_1", torch.nn.MaxPool2d(kernel_size=2))
        self.conv.add_module("relu_1", torch.nn.ReLU())
        self.conv.add_module("conv_2", torch.nn.Conv2d(10, 20, kernel_size=5))
        self.conv.add_module("dropout_2", torch.nn.Dropout())
        self.conv.add_module("maxpool_2", torch.nn.MaxPool2d(kernel_size=2))
        self.conv.add_module("relu_2", torch.nn.ReLU())

        self.fc = torch.nn.Sequential()
        self.fc.add_module("fc1", torch.nn.Linear(320, 50))
        self.fc.add_module("relu_3", torch.nn.ReLU())
        self.fc.add_module("dropout_3", torch.nn.Dropout())
        self.fc.add_module("fc2", torch.nn.Linear(50, output_dim))

In the code, we are having 2 convolution layer followed by 1 fully connected layer. After each convolution layer we have 1 max pooling and 1 ‘relu’.

But my questions are the following:

  1. Why we are having 10 in first convolution layer? Generally output of first conv layer is given by (W−F+2P)/S+1.
  2. In second conv layer we have output as 20? Again why we are using it.
  3. In the fully connected layer, we have 320 & 50 as input and output respectively. From where we are getting these values.

I’m completely new to pytorch. So having difficulty in understanding simple things. The documentations are always not very clear.

If someone can answer these questions, then it will be very helpful.

Thank you!

You seem to be confusing Spatial output vs Channel Output. The number 10 and 20 represent channels. The spatial output size is calculated from the formula you mentioned (if you have dilation factor, that is also considered in pytorch)
Sequenctially this is what is happening

  • (28x28x1) -> conv1 -> (28-5+1) -> (24x24x10)
  • (24x24x10) ->max1 -> (24/2) -> (12x12x10)
  • (12x12x10) ->conv2 -> (12-5+1) -> (8x8x20)
  • (8x8x20) -> max2 -> (8/2) -> (4 x 4 x 20)
  • (4x4x20) = 320 -> Linear -> 50
  • 50 -> Linear -> output_dim

@ImgPrcSng: Thanks a lot for answering my question. I almost lost my hope that I get the answer to this simple question. But can you suggest me some literature on this difference between “Spatial output vs Channel Output”. Also, how to implement spatial method in pytorch. I’m coming from tensorflow world. So I can relate things with that world much more comfortably.

Thank you!

Pytorch Convolutions are no different from Tensorflow Convolutions. It is just a notation difference

Consider an image Ci x H x W, Where Ci is referred as input channels in the image, H and W are input spatial dimensions(width and height).

Lets say you want to convolve this image with a kernel of size K x K (lets keep stride =1 and padding = 0) to produce Co (output channels) Feature maps.

Thus the size of Convolution Kernel will be Co x Ci x K x K.
The operation produces Co x Ho x Wo, where Ho = ( H - K + 1), Wo = ( W - K + 1)
Co refers the number of feature maps (output channels) and Ho,Wo refer to the output spatial dimension, calculated using the same formula.

The choices we have are Co, K, stride, padding, dilation while defining a Conv Layer. The remaining information is either calculated(Wo, Ho) or fixed in the input (Ci, H, W)


Thank you Prasanna! I requested few literature, but gave me the complete answer.

Thanks a lot!

what is the input channel?

in_channels define the number of input channels for this particular convolution.
By default each kernel will use all input channels of the incoming activation and perform the convolution in the spatial dimensions.
CS231n - CNN gives a good explanation of the general work flow on convolutions.

Let me know, if something is unclear or if I misunderstood your question.

in nn.Sequential blocks like these (where the Conv layers are represented in terms of channel input/output) – how is the spatial input to such a block specified. Or is that implicit. TIA

Conv and pooling layers work on variable spatial input shapes as long as it’s larger than the kernel size.
Conv layers only need the definition of the input channels (defined by the activation input), output channels (number of filters), as well as the kernel size.
Let me know, if I misunderstood your question.

thanks, the explanation helps me understand why layer definitions don’t specify input shapes (thus enabling them to be applied to variable input shapes). are the shapes are just derived from the data that is passed, rather than being passed as an explicit parameter?

Yes, the spatial output size depends on the passed inputs size, the kernel size, padding, stride and the dilation. This paper gives more details about the conv arithmetic and the docs for nn.Conv2d give you the applied formula.