Maybe I don’t understand something regarding depthwise seperable convolutions but if you set the argument *groups=input channels* in **nn.Conv2D**, as a result you get only one kernel per filter, no matter many input channels there are. To my understanding normal convolutional operation has as many unique kernels per filter as there are input channels. Doesn’t that make depthwise convolutional filter lose some information comparing to normal **Conv2D** if it used different (unique) kernels for every input channel?

That’s not the case, as a “standard” convolution will use filters, which use all input channels by default. The `out_channels`

define the number of filters. CS231n - Convolutional Layers describes this in more detail.

Depthwise convolutions with `groups=in_channels`

would use filters, where each group is applied to on input channel only. Since the parameter count is reduced you are often applying a `1x1`

“pointwise” convolution, which is applied in the “standard” way afterwards.

Excuse me that was my bad, my statement was wrong, I updated my question as well. What I wanted to say in that sentence is:

To my understanding normal convolutional operation has as many unique kernels per filter as there are input channels.

Regarding this statement below:

Depthwise convolutions with

`groups=in_channels`

would use filters, where each group is applied to on input channel only. Since the parameter count is reduced you are often applying a`1x1`

“pointwise” convolution, which is applied in the “standard” way afterwards.

As I understand if `groups=in_channels`

, then one filter has only one unique kernel, for example let’s make that one kernel and fill it with values all equal to 1 with kernel size 3x3. So the same kernel would be used on all the input channels (for example channel R, channel G, channel B). If it was a normal convolution then there would be a filter with unique kernel for all the RGB channels, for example: for channel R there would be kernel 3 x 3 which values are all 1, channel G would be convolved with kernel 3x3 with all values equal to 2, and channel B would be convolved with kernel whose values are all 3, and all of these 3 kernels would be a part of one filter, and other filters would have their own 3 unique kernels.

What I don’t understand is why depthwise seperable convolution uses only one kernel per filter if `groups=in_channels`

? Does’t the fact that the same kernel for all channels make the model lose a little bit of information or is that a whole point of depthwise seperable convolution aside from the fact that it drastically reduces the computational cost?

To my understanding normal convolutional operation has as many unique kernels per filter as there are input channels.

“Kernels” and “filters” are used interchangeable and refer to the same object. If you refer to “kernel” as a 2D kernel, which is stacked into a “filter”, then you are right, but note that this description is not commonly used in the deep learning area.

As I understand if

`groups=in_channels`

, then one filter has only one unique kernel, for example let’s make that one kernel and fill it with values all equal to 1 with kernel size 3x3. So the same kernel would be used on all the input channels (for example channel R, channel G, channel B).

No, if `groups=in_channels`

, each filter will be used for a separate input channel. The Guide to Convolution Arithmetic gives examples and visualizes it in a beautiful way. Here is additionally a code snippet you could play around to check how the kernels are applied:

```
conv = nn.Conv2d(3, 3, 3, 1, 1, bias=False, groups=3)
print(conv.weight.shape)
with torch.no_grad():
conv.weight = nn.Parameter(
torch.stack((
torch.ones(1, 3, 3),
torch.ones(1, 3, 3) * 2,
torch.ones(1, 3, 3) * 3
))
)
x = torch.cat((
torch.ones(1, 3, 3),
torch.ones(1, 3, 3) * 2,
torch.ones(1, 3, 3) * 3
), dim=0).unsqueeze(0)
print(x.shape)
out = conv(x)
print(out.shape)
print(out)
```

What I don’t understand is why depthwise seperable convolution uses only one kernel per filter if

`groups=in_channels`

? Does’t the fact that the same kernel for all channels make the model lose a little bit of information or is that a whole point of depthwise seperable convolution aside from the fact that it drastically reduces the computational cost?

I think the idea was introduces in the Xception paper and the authors might explain their ideas in this paper.

Hi again, I’m still a bit lost. I analysed your code that you gave me, but still that doesn’t answer a different question - why dephwise convs use the the same set of weights for all channels.

```
conv_1 = nn.Conv2d(3, 9, 3, 1, 1, bias=False, groups=1)
conv_2 = nn.Conv2d(3, 9, 3, 1, 1, bias=False, groups=3)
print(f'normal conv2d weights shape: {conv_1.weight.shape}')
print(f'depthwise separable conv weights shape: {conv_2.weight.shape}')
```

Answer:

```
normal conv2d weights shape: torch.Size([9, 3, 3, 3])
depthwise separable conv weights shape: torch.Size([9, 1, 3, 3])
```

This code clearly shows that depthwise separable convs use only one kernel for all input channels. Why can’t depthwise separable convs use different kernel for every input channel?

Excuse me if I’m being too noisy, but I really want to get the whole idea behind depthwise convs.

why dephwise convs use the the same set of weights for all channels.

It doesn’t and each filter is used for a particular input channel in my example.

If you check the weight, input, and output, you’ll see:

```
filters:
Parameter containing:
tensor([[[[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]]],
[[[2., 2., 2.],
[2., 2., 2.],
[2., 2., 2.]]],
[[[3., 3., 3.],
[3., 3., 3.],
[3., 3., 3.]]]], requires_grad=True)
input:
tensor([[[[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]],
[[2., 2., 2.],
[2., 2., 2.],
[2., 2., 2.]],
[[3., 3., 3.],
[3., 3., 3.],
[3., 3., 3.]]]])
output:
tensor([[[[ 4., 6., 4.],
[ 6., 9., 6.],
[ 4., 6., 4.]],
[[16., 24., 16.],
[24., 36., 24.],
[16., 24., 16.]],
[[36., 54., 36.],
[54., 81., 54.],
[36., 54., 36.]]]], grad_fn=<MkldnnConvolutionBackward>)
```

Based on the output you can see that only filter0 is applied to input_channel0, filter1 to input_channel1, and filter2 to input_channel2.

The center output value indicates it most clearly and you can manually recompute it.

Arturas_Druteika:why dephwise convs use the the same set of weights for all channels.

It doesn’t and each filter is used for a particular input channel in my example.

I think I got it but I just wanna be sure. Below I declared 1 conv which is depthwise separable conv.

```
conv_1 = nn.Conv2d(3, 9, 3, 1, 1, groups=3)
```

The **conv_1** weighrs are:

```
tensor([[[[-0.0161, 0.0004, -0.1099],
[ 0.0736, -0.1180, -0.3216],
[-0.0853, 0.0933, -0.1864]]],
[[[ 0.1278, -0.1644, 0.2590],
[-0.1445, -0.1871, -0.0343],
[-0.0603, 0.3219, -0.0483]]],
[[[-0.3080, 0.2339, 0.2711],
[ 0.0146, 0.1944, -0.1174],
[-0.1286, 0.0212, -0.0767]]],
[[[ 0.1212, -0.1680, 0.0872],
[-0.0935, -0.1477, -0.1296],
[ 0.2756, 0.1593, -0.3307]]],
[[[-0.2472, -0.2188, 0.0018],
[-0.2081, -0.1654, 0.2874],
[ 0.0344, 0.2792, 0.0798]]],
[[[-0.2411, 0.3194, -0.1092],
[ 0.1661, -0.1743, -0.1464],
[ 0.2462, 0.2495, 0.2859]]],
[[[ 0.2739, -0.3026, -0.1970],
[-0.2544, -0.1160, -0.0240],
[-0.0554, 0.0301, 0.2732]]],
[[[-0.0789, -0.2924, -0.1171],
[-0.0089, -0.1265, -0.3290],
[ 0.2165, 0.3325, 0.0016]]],
[[[-0.3004, -0.2152, -0.3310],
[ 0.0623, 0.2407, -0.0975],
[-0.0268, 0.3014, 0.3107]]]], requires_grad=True)
```

So the first 3 blocks of weights, specifically:

```
[[[-0.0161, 0.0004, -0.1099],
[ 0.0736, -0.1180, -0.3216],
[-0.0853, 0.0933, -0.1864]]],
[[[ 0.1278, -0.1644, 0.2590],
[-0.1445, -0.1871, -0.0343],
[-0.0603, 0.3219, -0.0483]]],
[[[-0.3080, 0.2339, 0.2711],
[ 0.0146, 0.1944, -0.1174],
[-0.1286, 0.0212, -0.0767]]],
```

would be convolved on the input, and then the following 3 blocks and then the last 3? Do I now understand it the correct way?

Yes, the first three filters would be used for the first input channel, then the second 3 filters for the second input channel, etc.

You can always manually verify it:

```
conv_1 = nn.Conv2d(3, 9, 3, 1, 1, groups=3, bias=False)
x = torch.randn(1, 3, 4, 4)
out_ref = conv_1(x)
x0, x1, x2 = x.split(1, dim=1)
out0 = F.conv2d(x0, conv_1.weight[0:1], stride=1, padding=1)
out1 = F.conv2d(x0, conv_1.weight[1:2], stride=1, padding=1)
out2 = F.conv2d(x0, conv_1.weight[2:3], stride=1, padding=1)
out3 = F.conv2d(x1, conv_1.weight[3:4], stride=1, padding=1)
out4 = F.conv2d(x1, conv_1.weight[4:5], stride=1, padding=1)
out5 = F.conv2d(x1, conv_1.weight[5:6], stride=1, padding=1)
out6 = F.conv2d(x2, conv_1.weight[6:7], stride=1, padding=1)
out7 = F.conv2d(x2, conv_1.weight[7:8], stride=1, padding=1)
out8 = F.conv2d(x2, conv_1.weight[8:9], stride=1, padding=1)
out = torch.cat((out0, out1, out2, out3, out4, out5, out6, out7, out8), dim=1)
print((out_ref-out).abs().max())
> tensor(1.1921e-07, grad_fn=<MaxBackward1>)
```

Below I declared 1 conv which is depthwise separable conv.

If I’m not mistaken a “depthwise-separable convolution” would mean a depthwise conv with `in_channels=out_channels=groups`

followed by another convolution with a `1x1`

kernel, so I would recommend to check the previously posted reference paper.

If I’m not mistaken a “depthwise-separable convolution” would mean a depthwise conv with

`in_channels=out_channels=groups`

followed by another convolution with a`1x1`

kernel, so I would recommend to check the previously posted reference paper.

Yes, I knew that this is a sort of "2-nd step’ so I didn’t mention it because this part didn’t confuse me.

Thank you for spending time to explain me how to understand deptwise conv nets.