I think the problem might be memory bandwidth.
if you implement Depthwise Separable convolution something like this:
class SeparableConv2d(nn.Module):
def __init__(self,in_channels,out_channels,kernel_size=1,stride=1,padding=0,dilation=1,bias=False):
super(SeparableConv2d,self).__init__()
self.conv1 = nn.Conv2d(in_channels,in_channels,kernel_size,stride,padding,dilation,groups=in_channels,bias=bias)
self.pointwise = nn.Conv2d(in_channels,out_channels,1,1,0,1,1,bias=bias)
def forward(self,x):
x = self.conv1(x)
x = self.pointwise(x)
return x
Then your layer winds up needing to write to twice as much memory compared to a typical conv2d.
This memory usage shouldn’t be necessary though. If the code was written as a single layer in CUDA, each pixel of the task could be done together and only the final result would need to be written to vram. This would likely result in a massive speedup for this operation.