Group convolution takes much longer than normal convolution


I have an issue with time to be taken when I use group convolution

I compared time with normal convolution which have same number of parameters and it takes much shorter time

Test code is as follows :

% dataloader is given. dataloader provides two batch data of (32,12,50,50) shape
loader = enumerate(dataloader)
net1 = torch.nn.Conv2d(12, 300, kernel_size=5, stride=1, padding=(2,2), bias=False) % normal convolution
net2 = torch.nn.Conv2d(12, 12*300, groups=12, kernel_size=5, stride=1, padding=(2,2), bias=False) % group convolution
% compare normal conv and group conv
for _ in range(3):
  _, (a,b) = loader.__next__()
  a = a.cuda()
  start = time.time()
  o = net1(a) % line**
  end = time.time()

I recorded a time to take for convolution operation and repeat again replacing line** to “o = net2(a)”
Following is when I used normal convolution

iter 0 : 0.12s
iter 1 : 0.00536s
iter 2 : 0.00494s

And then I replaced net1->net2 i.e., normal conv to group conv

iter 0 : 0.2497s
iter 1 : 0.2162s
iter 2 : 0.20505s

I guess that net1 and net2 have the same number of weight parameters and they give exactly the same result when reshape and addition over input channel are performed to group convolution result. However time cost is so different. Especially, time for normal convolution dramatically reduces after one iteration while not for group convolution

Can I run group convolution as fast as normal convolution??

I need group convolution to implement following custom layer
X1…Xn (Xi is i-th channel) --> Y1 … Ym (Yj is j-th channel)
where Yj = \Sigma_i (Xi * Kij) @ (Xi * L)
Kij is kernel, * is convolution, @ is element-wise multiplication and L is another fixed kernel.
To implement this I used group convolution for two convolution and then elementwise-multipy and add over one channel. But it is too slow.

Yes, it’s very slow :slightly_frowning_face:. This has been mentioned many times in github issues but there haven’t been any fixes and I don’t know if anybody is working on this. E.g.
Group convolution slower than manually running separate convolutions in CUDA streams · Issue #73764 · pytorch/pytorch · GitHub,
FP32 depthwise convolution is slow in GPU · Issue #18631 · pytorch/pytorch · GitHub,
Training grouped Conv2D is slow · Issue #70954 · pytorch/pytorch · GitHub