3D depthwise separable convolution

Hi all,

I try to implement a depthwise separable convolution as described in the Xception paper for 3D input data (batch size, channels, x, y, z). Is the following class correct or am I missing something?

class DepthwiseSeparableConv(nn.Module):
  def __init__(self, in_channels, out_channels):
    super(DepthwiseSeparableConv, self).__init__()
    self.depthwise = nn.Conv3d(in_channels=in_channels, out_channels=in_channels, kernel_size=3, stride=1, padding=1, dilation=1, groups=in_channels, bias=False, padding_mode='zeros')
    self.pointwise = nn.Conv3d(in_channels=in_channels, out_channels=out_channels, kernel_size=1, stride=1, padding=0, dilation=1, groups=1, bias=False)
    self.bn = nn.BatchNorm3d(num_features=out_channels)

  def forward(self, x):
    out = self.depthwise(x)
    out = self.pointwise(out)
    return self.bn(out)

I think it might be necessary to additionally add a parameter for the number of kernels according to this example. But what would the number of kernels be then? I could not read something about this in the paper.

1 Like

Hi, depthwise separable convolution is used instead of ordinary convolution. In your code you just take ordinary convolution using Conv3d plus pointwise convolution. Unique feature of depthwise convolution is work on channels separately (1 filter = 1 input channel), but ordinary convolution processes all channels (1 filter = information from all input layers)

Ok, I’ve explored some other code and now I’m not so sure about my commentary…

When reading the paper, note:

“A [2d] convolution layer attempts to learn filters in a 3D space, with 2 spatial dimensions (width and height) and a channel dimension; thus a single convolution kernel is tasked with simultaneously mapping cross-channel correlations and spatial correlations.”
emphasis mine

They just mean a Conv2d with 3x3 and stride of 1, followed by 1x1 and stride of 1.

image

In the above case, you’d set the groups argument to 3. For example:

simp_inception = nn.Sequential(nn.Conv2d(in_channels, out_channels, (3,3), padding=1, groups=3), 
                               nn.Conv2d(out_channels, out_channels, (1,1), groups=3))

There are multiple variations in the same paper, but that is the general gist of it.

Lastly, I would like to also bring to your attention that Pytorch now has a native Inception_v3 model you can call: