Conv2d layer dimension order using shape

It seems that the input and output channels are reversed in the definition of the function and when using tensor shape. That is, we have

torch.nn. Conv2d (in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode=‘zeros’ )

if when in_channels=16, and out_channels = 32 (and kernel_size =3) and we call


we get (32, 16, 3, 3).

Is that correct?

A simple experiment shows that the output is (32,16,3,3) - and this is correct:

import torch
conv = torch.nn.Conv2d(16, 32, 3)

>>> torch.Size([32, 16, 3, 3])

Please, think about it. We have a convolution window (kernel) with shape 3x3 - part 3,3. Now, we need as many windows as there are input channels - so in this case we need 16 kernels - part 16,3,3. And all these numbers 16,3,3 are needed to create only one new channel output. 1,16,3,3. But you want to have 32 outputs, so you need 32,16,3,3 number of weights to be able to calculate this.

See example: gif
To create one new output channel based on three input channels(rgb) you need to store (1, 3, 3, 3) weights. To get two outputs channels you need to double number of weights: (2, 3, 3, 3)