Using optimised depthwise convolutions

I think the problem might be memory bandwidth.

if you implement Depthwise Separable convolution something like this:

class SeparableConv2d(nn.Module):
    def __init__(self,in_channels,out_channels,kernel_size=1,stride=1,padding=0,dilation=1,bias=False):
        super(SeparableConv2d,self).__init__()

        self.conv1 = nn.Conv2d(in_channels,in_channels,kernel_size,stride,padding,dilation,groups=in_channels,bias=bias)
        self.pointwise = nn.Conv2d(in_channels,out_channels,1,1,0,1,1,bias=bias)
    
    def forward(self,x):
        x = self.conv1(x)
        x = self.pointwise(x)
        return x

Then your layer winds up needing to write to twice as much memory compared to a typical conv2d.

This memory usage shouldn’t be necessary though. If the code was written as a single layer in CUDA, each pixel of the task could be done together and only the final result would need to be written to vram. This would likely result in a massive speedup for this operation.