Algorithim of how Conv2d is implemented in PyTorch

Deathstroke_Twelved · January 23, 2020, 11:06pm

I am working on an inference model of a pytorch onnx model which is why this question is being asked.

Assume, I have a image with dimensions 32 x 32 x 3 (CIFAR-10 dataset). I pass it through a Conv2d with dimensions : 3 x 192 x 5 x 5. The command I used is: Conv2d(3, 192, kernel_size=5, stride=1, padding=2)

Using the formula (stated here for reference pg12 https://arxiv.org/pdf/1603.07285.pdf) I should be getting an output image with dimensions 28 x 28 x 192 (input - kernel + 1 = 32 - 5 + 1).

Question is how has PyTorch implemented this 4d tensor 3 x 192 x 5 x 5 to get me an output of 28 x 28 x 192 ? The layer is a 4d tensor and the input image is a 2d one.

How is the kernel (5x5) spread in the image matrix 32 x 32 x 3 ? What does the kernel convolve with first -> 3 x 192 or 32 x 32?

Note : I have understood the 2d aspects of things. I am asking the above questions in 3 dimensions or more.

ptrblck · January 24, 2020, 6:42am

Since you are using a padding of 2 for a kernel size of 5, the spatial output should be constant.
The input is expected to be a 4-dimensional tensor in the shape [batch_size, channels, height, width], while the kernel has the shape [out_channels, in_channels, height, width].

In the default setup (not depthwise or grouped convolution) the kernel will be applied to patches of the input in a sliding window manner using all input channels.
Have a look at CS231n which explains the underlying method pretty good.

Deathstroke_Twelved · January 24, 2020, 11:13am

Thanks for the answer @ptrblck. But my question was more related to PyTorch’s implementation.
When I got the tensor.size from the Conv2d layer it is :- [3 x 192 x 5 x 5]. And 3 x 192 x 5 x 5 = 14400 weights are stored in the pth.tar file (checkpoint file). How is this tensor arranged ?
I understand 192 is the depth, 5 x 5 is the kernel size. Are there 3 tensors of depth 192 with size 5 x 5 being returned from this layer? What does the 3 stand for here then in the pth.tar file.

ptrblck · January 24, 2020, 6:31pm

As explained, the 3 would stand for the number of input channels.
You have basically 192 kernels each with 3 channels and a spatial size of 5x5.
The provided link explains the applied method.

PyTorch dispatches to different backends as seen here in case you are interested in the implementation details.