Looking for ways to speed this up

class DSConv(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size, padding=0, stride=1):
        super(DSConv, self).__init__()
        
        self.convs   = [nn.Conv2d(1, 1, kernel_size=kernel_size, padding=padding, stride=stride) for _ in range(in_channels)]
        self.conv1x1 = nn.Conv2d(in_channels, out_channels, kernel_size=1)
        
    def forward(self, x):
        channels = [self.convs[idx](ch) for idx, ch in enumerate(x.split(1, dim=1))]
        return self.conv1x1(torch.cat(channels, dim=1))

The mapping / list comprehension in the forward pass looks to be a bottleneck. Are there any best-practice ways I can parallelize this so that this runs faster?

At the moment, there isn’t a way to speed up group convolutions (from the code, this is what seems to being done).
The next version of CuDNN hopefully has this implemented. (generally Convolutions on the GPU are notoriously hard to speedup because we’re already at a place where CuDNN does a ridiculous good job).

@smth, are you saying this is the preferred method in the case of independent subnets?

class Net(nn.Module):
   def __init__(self, num, embedding_dim):
       super().__init__()

       # list of independent DCNN nets
       self.sub_nets = nn.ModuleList([subnet(output_size = embedding_dim)
                                for i in range(num)])
       
       ...
   
   def forward(self, x):
       
       # forward on the expert nets
       subset_outs = torch.stack([model(x) for model in self.sub_nets])
       
       ...                           

We now have nn.Conv2d take a groups parameter (the post is from 2017, things are improved now). If you use the groups option, it should use cudnn and be a bit faster