# Algorithim of how Conv2d is implemented in PyTorch

I am working on an inference model of a pytorch onnx model which is why this question is being asked.

Assume, I have a image with dimensions `32 x 32 x 3` (CIFAR-10 dataset). I pass it through a Conv2d with dimensions : `3 x 192 x 5 x 5`. The command I used is: `Conv2d(3, 192, kernel_size=5, stride=1, padding=2)`

Using the formula (stated here for reference pg12 https://arxiv.org/pdf/1603.07285.pdf) I should be getting an output image with dimensions `28 x 28 x 192` (`input - kernel + 1 = 32 - 5 + 1`).

Question is how has PyTorch implemented this 4d tensor `3 x 192 x 5 x 5` to get me an output of `28 x 28 x 192` ? The layer is a 4d tensor and the input image is a 2d one.

How is the kernel (`5x5`) spread in the image matrix `32 x 32 x 3` ? What does the kernel convolve with first -> `3 x 192` or `32 x 32`?

Note : I have understood the 2d aspects of things. I am asking the above questions in 3 dimensions or more.

1 Like

Since you are using a padding of 2 for a kernel size of 5, the spatial output should be constant.
The input is expected to be a 4-dimensional tensor in the shape `[batch_size, channels, height, width]`, while the kernel has the shape `[out_channels, in_channels, height, width]`.

In the default setup (not depthwise or grouped convolution) the kernel will be applied to patches of the input in a sliding window manner using all input channels.
Have a look at CS231n which explains the underlying method pretty good.

1 Like

Thanks for the answer @ptrblck. But my question was more related to PyTorch’s implementation.
When I got the tensor.size from the Conv2d layer it is :- [3 x 192 x 5 x 5]. And 3 x 192 x 5 x 5 = 14400 weights are stored in the pth.tar file (checkpoint file). How is this tensor arranged ?
I understand 192 is the depth, 5 x 5 is the kernel size. Are there 3 tensors of depth 192 with size 5 x 5 being returned from this layer? What does the 3 stand for here then in the pth.tar file.

As explained, the 3 would stand for the number of input channels.
You have basically 192 kernels each with 3 channels and a spatial size of `5x5`.
The provided link explains the applied method.

PyTorch dispatches to different backends as seen here in case you are interested in the implementation details.

1 Like