I was trying to follow this tutorial. But I’m not fully understanding the following section of the code. The code used mnist data, of size 28 x 28 x 1 & 10 classes.
You seem to be confusing Spatial output vs Channel Output. The number 10 and 20 represent channels. The spatial output size is calculated from the formula you mentioned (if you have dilation factor, that is also considered in pytorch)
Sequenctially this is what is happening
@ImgPrcSng: Thanks a lot for answering my question. I almost lost my hope that I get the answer to this simple question. But can you suggest me some literature on this difference between “Spatial output vs Channel Output”. Also, how to implement spatial method in pytorch. I’m coming from tensorflow world. So I can relate things with that world much more comfortably.
Pytorch Convolutions are no different from Tensorflow Convolutions. It is just a notation difference
Consider an image Ci x H x W, Where Ci is referred as input channels in the image, H and W are input spatial dimensions(width and height).
Lets say you want to convolve this image with a kernel of size K x K (lets keep stride =1 and padding = 0) to produce Co (output channels) Feature maps.
Thus the size of Convolution Kernel will be Co x Ci x K x K.
The operation produces Co x Ho x Wo, where Ho = ( H - K + 1), Wo = ( W - K + 1)
Co refers the number of feature maps (output channels) and Ho,Wo refer to the output spatial dimension, calculated using the same formula.
The choices we have are Co, K, stride, padding, dilation while defining a Conv Layer. The remaining information is either calculated(Wo, Ho) or fixed in the input (Ci, H, W)
in_channels define the number of input channels for this particular convolution.
By default each kernel will use all input channels of the incoming activation and perform the convolution in the spatial dimensions. CS231n - CNN gives a good explanation of the general work flow on convolutions.
Let me know, if something is unclear or if I misunderstood your question.
in nn.Sequential blocks like these (where the Conv layers are represented in terms of channel input/output) – how is the spatial input to such a block specified. Or is that implicit. TIA
Conv and pooling layers work on variable spatial input shapes as long as it’s larger than the kernel size.
Conv layers only need the definition of the input channels (defined by the activation input), output channels (number of filters), as well as the kernel size.
Let me know, if I misunderstood your question.
thanks, the explanation helps me understand why layer definitions don’t specify input shapes (thus enabling them to be applied to variable input shapes). are the shapes are just derived from the data that is passed, rather than being passed as an explicit parameter?
Yes, the spatial output size depends on the passed inputs size, the kernel size, padding, stride and the dilation. This paper gives more details about the conv arithmetic and the docs for nn.Conv2d give you the applied formula.