Why grouped convolution is slower than basic convolution?

kiradiso · December 10, 2019, 5:15am

I have follow module added to all blocks of res_layer4, when groups = 1(in self.tim), time cost in forward process is around 1e-3s, when groups = channel(channel wise convolution), time cost changed to 3e-2s.
So,why it is much more slower than basic convolution with only 1/channel weights size?

class TEI(nn.Module):
    def __init__(self, channel, att_type='mem', reduction=8, bn=False):
        super(TEI, self).__init__()
        assert att_type in ['mem', 'se', 'none']
        self.attention = None
        if att_type is 'mem':
            self.attention = MEM3D(channel, reduction)
        elif att_type is 'se':
            self.attention = SELayer3D(channel, reduction)
        # 3 x 1 x 1 3D CW Conv
        if bn is True:
            self.tim = nn.Sequential(nn.Conv3d(channel, channel, kernel_size=(3, 1, 1), padding=(1, 0, 0), groups=1), nn.BatchNorm3d(channel))
        else:
            self.tim = nn.Conv3d(channel, channel, kernel_size=(3, 1, 1), padding=(1, 0, 0), groups=1)

    def forward(self, x):
        if self.attention is not None:
            x = self.attention(x)
        x = self.tim(x)
        return x

ptrblck · December 10, 2019, 5:18am

I assume you’re synchronized the code to get a proper profiling result, if you are running on the GPU.
Anyway, the most likely reason is, that some convolution setups are highly optimized while others not so much.

kiradiso · December 10, 2019, 6:48am

Thanks a lot. maybe groups = channel will decrease the computation intensity of model. When i set groups = 8, it is faster than both.

Eta_C · December 10, 2019, 6:57am

@ptrblck
Since groups == channels is faster than group convolution and basic convolution, I want to know something else.

If I set out_channels == groups == channels, it becomes a depthwise convolution. Does depthwise convolution is also highly optimized, or just because of the reduction of computation?

ptrblck · December 10, 2019, 5:49pm

I’ve profiled the cudnn depthwise convs for FP16 in this PR and used a heuristic to chose based on the workload.
Unfortunately, the answer is again “it depends” in this case.