Decomposing grouped convolutions doesn't result in speed up

Let’s say I have a model with convolutional layers that look like this:

conv_full = nn.Conv2d(in_channels, out_channels, kernel_size=(5, 5), padding=2, stride=1, bias=True)

In order to speed up the model one can try decomposing these convolutional layers into two consequent layers with kernel sizes (1, 5) and (5, 1) respectively.

So each layer would look somewhat like this:

conv_decomposed = nn.Sequential(
  nn.Conv2d(in_channels, in_channels, kernel_size=(5, 1), padding=(2, 0), stride=(1, 1), bias=True),
  nn.Conv2d(in_channels, out_channels, kernel_size=(1, 5), padding=(0, 2), stride=(1, 1), bias=True)
)

And it actually speeds up the model, I get almost 1.5x speed up as a result.

But when I introduce groups argument (equal to in_channels) into every layer, the decomposed version of this grouped convolution works slightly slower than the regular (5x5) grouped convolution.

The regular one looks like this:

conv_full = nn.Conv2d(in_channels, out_channels, kernel_size=(5, 5), padding=2, stride=1, bias=True, groups=in_channels)

And the decomposed one like this:

conv_decomposed = nn.Sequential(
  nn.Conv2d(in_channels, in_channels, kernel_size=(5, 1), padding=(2, 0), stride=(1, 1), bias=True, groups=in_channels),
  nn.Conv2d(in_channels, out_channels, kernel_size=(1, 5), padding=(0, 2), stride=(1, 1), bias=True, groups=in_channels)
)

It seems a bit counter-intuitive and not what I expected from it due to the fact that the number of operations is still lower for the decomposed layer.
Is it supposed to work like this? And what might be the reason for this behaviour?

Grouped convolutions (at least before, can’t tell about current state) lacked the level of optimization compared to regular convolutions.

1 Like

I forgot to mention that grouped convolutions in my example work faster than the regular ones. Both for decomposed and non-decomposed cases.
But anyway, so is the reasonable explanation a low-level implementations of the convolutions? I need to also note that I received such results on CPU, not GPU. Can it depend on the hardware it runs on?
Can it also depend on the framework?

Primarily it depends on kernel, but hardware can affect this too. For example MKL-DNN done by Intel, so one would expect better perfomance on Intel CPUs. You can use torch.profiler to estimate performance difference.

1 Like