I am recently studying the Xception paper and came across the depthwise separable convolution (DW conv). I think I understand how it works and how it implemented in Pytorch, but I don’t understand the number of parameters in that layers.
For example, I have a convolution layer (no bias) with in_channels = 16, out_channels = 32 and kernel_size = 3. For traditional convolution it should have 16x32x3x3 = 4608 parameters and for DW conv (which set group=in_channels per pytorch implementation), it has 16x3x3 + 16x1x1x32 = 656 parameters.
I printed the parameters in the conv layer by using parameters() function and verified the number of traditional conv layer but DW conv has a weight with size of 32x1x3x3, which differs from the Xception paper. It appears that the second term (1 in the example) is in_channels / group.
Can someone help me explain how this is implemented or did I do something wrong? Thank you very much for the help!