Why does a filter in depthwise convolution uses only one kernel for all channels instead of unique kernels for all channels?

Maybe I don’t understand something regarding depthwise seperable convolutions but if you set the argument groups=input channels in nn.Conv2D, as a result you get only one kernel per filter, no matter many input channels there are. To my understanding normal convolutional operation has as many unique kernels per filter as there are input channels. Doesn’t that make depthwise convolutional filter lose some information comparing to normal Conv2D if it used different (unique) kernels for every input channel?

That’s not the case, as a “standard” convolution will use filters, which use all input channels by default. The out_channels define the number of filters. CS231n - Convolutional Layers describes this in more detail.

Depthwise convolutions with groups=in_channels would use filters, where each group is applied to on input channel only. Since the parameter count is reduced you are often applying a 1x1 “pointwise” convolution, which is applied in the “standard” way afterwards.

Excuse me that was my bad, my statement was wrong, I updated my question as well. What I wanted to say in that sentence is:

Regarding this statement below:

As I understand if groups=in_channels, then one filter has only one unique kernel, for example let’s make that one kernel and fill it with values all equal to 1 with kernel size 3x3. So the same kernel would be used on all the input channels (for example channel R, channel G, channel B). If it was a normal convolution then there would be a filter with unique kernel for all the RGB channels, for example: for channel R there would be kernel 3 x 3 which values are all 1, channel G would be convolved with kernel 3x3 with all values equal to 2, and channel B would be convolved with kernel whose values are all 3, and all of these 3 kernels would be a part of one filter, and other filters would have their own 3 unique kernels.
What I don’t understand is why depthwise seperable convolution uses only one kernel per filter if groups=in_channels? Does’t the fact that the same kernel for all channels make the model lose a little bit of information or is that a whole point of depthwise seperable convolution aside from the fact that it drastically reduces the computational cost?

“Kernels” and “filters” are used interchangeable and refer to the same object. If you refer to “kernel” as a 2D kernel, which is stacked into a “filter”, then you are right, but note that this description is not commonly used in the deep learning area.

No, if groups=in_channels, each filter will be used for a separate input channel. The Guide to Convolution Arithmetic gives examples and visualizes it in a beautiful way. Here is additionally a code snippet you could play around to check how the kernels are applied:

conv = nn.Conv2d(3, 3, 3, 1, 1, bias=False, groups=3)
print(conv.weight.shape)

with torch.no_grad():
    conv.weight = nn.Parameter(
        torch.stack((
            torch.ones(1, 3, 3),
            torch.ones(1, 3, 3) * 2,
            torch.ones(1, 3, 3) * 3
        ))
    )

x = torch.cat((
    torch.ones(1, 3, 3),
    torch.ones(1, 3, 3) * 2,
    torch.ones(1, 3, 3) * 3
), dim=0).unsqueeze(0)
print(x.shape)

out = conv(x)
print(out.shape)

print(out)

I think the idea was introduces in the Xception paper and the authors might explain their ideas in this paper.

Hi again, I’m still a bit lost. I analysed your code that you gave me, but still that doesn’t answer a different question - why dephwise convs use the the same set of weights for all channels.

conv_1 = nn.Conv2d(3, 9, 3, 1, 1, bias=False, groups=1)
conv_2 = nn.Conv2d(3, 9, 3, 1, 1, bias=False, groups=3)
print(f'normal conv2d weights shape: {conv_1.weight.shape}')
print(f'depthwise separable conv weights shape: {conv_2.weight.shape}')

Answer:

normal conv2d weights shape: torch.Size([9, 3, 3, 3])
depthwise separable conv weights shape: torch.Size([9, 1, 3, 3])

This code clearly shows that depthwise separable convs use only one kernel for all input channels. Why can’t depthwise separable convs use different kernel for every input channel?
Excuse me if I’m being too noisy, but I really want to get the whole idea behind depthwise convs.

It doesn’t and each filter is used for a particular input channel in my example.
If you check the weight, input, and output, you’ll see:

filters:
Parameter containing:
tensor([[[[1., 1., 1.],
            [1., 1., 1.],
            [1., 1., 1.]]],


          [[[2., 2., 2.],
            [2., 2., 2.],
            [2., 2., 2.]]],


          [[[3., 3., 3.],
            [3., 3., 3.],
            [3., 3., 3.]]]], requires_grad=True)

input:
tensor([[[[1., 1., 1.],
            [1., 1., 1.],
            [1., 1., 1.]],

           [[2., 2., 2.],
            [2., 2., 2.],
            [2., 2., 2.]],

           [[3., 3., 3.],
            [3., 3., 3.],
            [3., 3., 3.]]]])

output:
tensor([[[[ 4.,  6.,  4.],
            [ 6.,  9.,  6.],
            [ 4.,  6.,  4.]],

           [[16., 24., 16.],
            [24., 36., 24.],
            [16., 24., 16.]],

           [[36., 54., 36.],
            [54., 81., 54.],
            [36., 54., 36.]]]], grad_fn=<MkldnnConvolutionBackward>)

Based on the output you can see that only filter0 is applied to input_channel0, filter1 to input_channel1, and filter2 to input_channel2.
The center output value indicates it most clearly and you can manually recompute it.

I think I got it but I just wanna be sure. Below I declared 1 conv which is depthwise separable conv.

conv_1 = nn.Conv2d(3, 9, 3, 1, 1, groups=3)

The conv_1 weighrs are:

tensor([[[[-0.0161,  0.0004, -0.1099],
          [ 0.0736, -0.1180, -0.3216],
          [-0.0853,  0.0933, -0.1864]]],

        [[[ 0.1278, -0.1644,  0.2590],
          [-0.1445, -0.1871, -0.0343],
          [-0.0603,  0.3219, -0.0483]]],

        [[[-0.3080,  0.2339,  0.2711],
          [ 0.0146,  0.1944, -0.1174],
          [-0.1286,  0.0212, -0.0767]]],

        [[[ 0.1212, -0.1680,  0.0872],
          [-0.0935, -0.1477, -0.1296],
          [ 0.2756,  0.1593, -0.3307]]],

        [[[-0.2472, -0.2188,  0.0018],
          [-0.2081, -0.1654,  0.2874],
          [ 0.0344,  0.2792,  0.0798]]],

        [[[-0.2411,  0.3194, -0.1092],
          [ 0.1661, -0.1743, -0.1464],
          [ 0.2462,  0.2495,  0.2859]]],

        [[[ 0.2739, -0.3026, -0.1970],
          [-0.2544, -0.1160, -0.0240],
          [-0.0554,  0.0301,  0.2732]]],

        [[[-0.0789, -0.2924, -0.1171],
          [-0.0089, -0.1265, -0.3290],
          [ 0.2165,  0.3325,  0.0016]]],

        [[[-0.3004, -0.2152, -0.3310],
          [ 0.0623,  0.2407, -0.0975],
          [-0.0268,  0.3014,  0.3107]]]], requires_grad=True)

So the first 3 blocks of weights, specifically:

        [[[-0.0161,  0.0004, -0.1099],
          [ 0.0736, -0.1180, -0.3216],
          [-0.0853,  0.0933, -0.1864]]],

        [[[ 0.1278, -0.1644,  0.2590],
          [-0.1445, -0.1871, -0.0343],
          [-0.0603,  0.3219, -0.0483]]],

        [[[-0.3080,  0.2339,  0.2711],
          [ 0.0146,  0.1944, -0.1174],
          [-0.1286,  0.0212, -0.0767]]],

would be convolved on the input, and then the following 3 blocks and then the last 3? Do I now understand it the correct way?

Yes, the first three filters would be used for the first input channel, then the second 3 filters for the second input channel, etc.
You can always manually verify it:

conv_1 = nn.Conv2d(3, 9, 3, 1, 1, groups=3, bias=False)
x = torch.randn(1, 3, 4, 4)
out_ref = conv_1(x)

x0, x1, x2 = x.split(1, dim=1)

out0 = F.conv2d(x0, conv_1.weight[0:1], stride=1, padding=1)
out1 = F.conv2d(x0, conv_1.weight[1:2], stride=1, padding=1)
out2 = F.conv2d(x0, conv_1.weight[2:3], stride=1, padding=1)

out3 = F.conv2d(x1, conv_1.weight[3:4], stride=1, padding=1)
out4 = F.conv2d(x1, conv_1.weight[4:5], stride=1, padding=1)
out5 = F.conv2d(x1, conv_1.weight[5:6], stride=1, padding=1)

out6 = F.conv2d(x2, conv_1.weight[6:7], stride=1, padding=1)
out7 = F.conv2d(x2, conv_1.weight[7:8], stride=1, padding=1)
out8 = F.conv2d(x2, conv_1.weight[8:9], stride=1, padding=1)

out = torch.cat((out0, out1, out2, out3, out4, out5, out6, out7, out8), dim=1)
print((out_ref-out).abs().max())
> tensor(1.1921e-07, grad_fn=<MaxBackward1>)

If I’m not mistaken a “depthwise-separable convolution” would mean a depthwise conv with in_channels=out_channels=groups followed by another convolution with a 1x1 kernel, so I would recommend to check the previously posted reference paper.

1 Like

Yes, I knew that this is a sort of "2-nd step’ so I didn’t mention it because this part didn’t confuse me.
Thank you for spending time to explain me how to understand deptwise conv nets.