Why the dimension of weight in class _ConvNd is like this?

I’m reading the code of Pytorch recently and could not understand the dimension of the weights of a CNN layer.

Link to the code I’m talking about from github.

The weight is defined as:

self.weight = Parameter(torch.Tensor(
                out_channels, in_channels // groups, *kernel_size))

However, if the in_channels = 3, out_channels = 9, groups = 1, then there will be 27 weight matrix of *kernel_size (27 filters) created.
But, I think, there should be only 3 weight matrix of *kernel_size (3 filters) and they are shared among the 3 input channels, isn’t it?

Could anyone told me which part of my thinking is wrong?

each output channel is connected to all input channels. hence out_channels * in_channels filters exist.

Do you mean that, in this implementation, there is no parameter sharing?
Each output channel is the sum of 3 distinct filters applied on each input channel?
That is, in my previous case, if there are 27 filters [f1, …, f27].
Then,
f1 - f3 contribute to output_channel1.
f4 - f6 contribute to output_channel2.

f25 - f27 contribute to output_channel9.

Is it correct? Is the interaction between the 3 channels really a sum?


Another question, if the groups is set to 3, then each output channel will have only 1 filters contribute to it. So, 9 filters in total. Right?

Or should I think about it in this way?
Each filter is actually a cube of dimension (in_channel // groups, *kernel_size).

I explained a similar question in this thread.

COOL… That explained my problem with output_channel~ @prtblck

How about grouping?
Does it means to arrange the input channels into several groups. And each group will have a seperated fitler of dimension (input_channel // num_groups, kernel_size, kernel_size)? (like in AlexNet there are 2 groups?)

Sorry, haven’t seen this post.
The grouping parameter lets you decide how the filters are connected between the input channels and output channels.
E.g. in a vanilla convolution, each kernel will convolve the input using all input channels.
I.e. for an input of dimension [batch, 10, 24, 24], each kernel (with kernel_size=3 will have a dimension of [10, 3, 3]. The weights in this conv layer will therefore have a dimension of [number_of_kernels, 10, 3, 3,].

Using groups=2 for 20 kernels will yield a weight dimension of [20, 5, 3, 3].
The documentation explains:

At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.

That’s why each kernel will only see 5 input channels in my example.
Note that in_channels and out_channels both have to be divisible by groups.

For groups=in_channels each input channel will have its own set of filters.

Hope that still helps and sorry for the late reply :wink:

3 Likes

Thanks. Got the idea here.
The ideas in CNN is really not easier to be described by language than by visualization!

Yeah, you are right.
Sometimes it helps to create dummy layers and study the shape :wink:

@ezyang created an awesome visualization for convolutions. Check if out here.

2 Likes

Cool!
I also find one here