No Speedup with Depthwise Convolutions

I was experimenting with depthwise convolutions and noticed that I’m not seeing any performance increase over standard convolutions. I tried a few different MobileNet architectures to look into this, but for the sake of repeatability I’ll reference this script which is a basic implementation of a MobileNet model:

If I change the script to use groups=1, my runtime for a forward pass does not change at all, neither faster nor slower. GPU runtime on a forward pass is ~15ms and CPU runtime on a forward pass is ~250ms.

OS: Windows 10
GPU: GTX 1080
PyTorch: 1.0.1 (previously had 1.0.0 but upgraded to see if it made a difference)
CUDA: 9.0
cudnn: 7.4.1 (previously had 7.0.4 but upgraded to see if it made a difference)

I’m not sure if groups has anything to do with Depthwise Convolutions

After changing groups=1, what kind of increase/decrease would you expect?

I would expect the execution to be slower when groups=1 (or more specifically, I would expect it to be faster when groups is equal to the number of input channels). The nn.Conv2d docs page claims that is how you use depthwise convolutions in PyTorch.

How much slower are you expecting though?
The speed could be affect by other factors such as other layers or batch size, such that the difference is insignificant. Maybe it was 5% slower and you didn’t measure it correctly.

It should be much faster to use depthwise convolutions if I’m implementing it properly. See this GitHub issue for example. Any time the number of groups is set equal to the number of input channels, that layer executes 10-100x faster. That should be apparent even when using a simple timing mechanism such as time.time().

Note that in the above link you’re looking for any lines that say Group=1024, since that was the size of their input.

Doesn’t that link shows insignificant difference between group 1 and 2, same as what you’re doing here? You can try groups=1024 and see if it’s faster

I had planned on using depthwise convolutions on a future network, so was curious about this as well.

I forked that code and made the benchmark a little more extensive at to cover different batch sizes, run with and without CUDA, and do multiple trials rather than just one.

Pytorch v1.0.0
GPU name GeForce GTX 745
-----CUDA True-----
--batch size = 1
resnet18        0.05661s
alexnet         0.01414s
vgg16           0.01525s
/usr/local/lib/python3.6/dist-packages/torchvision/models/ UserWarning: nn.init.kaiming_uniform is now deprecated in favor of nn.init.kaiming_uniform_.
/usr/local/lib/python3.6/dist-packages/torchvision/models/ UserWarning: nn.init.normal is now deprecated in favor of nn.init.normal_.
  init.normal(, mean=0.0, std=0.01)
squeezenet1_0   0.02958s
mobilenet       0.04543s
mobilenet one group 0.10287s
mobilenet four group 0.11901s
--batch size = 4
resnet18        0.03556s
alexnet         0.00807s
vgg16           0.01727s
squeezenet1_0   0.04847s
mobilenet       0.06989s
mobilenet one group 0.29338s
mobilenet four group 0.28810s
--batch size = 32
resnet18        0.18783s
alexnet         0.00866s
vgg16           0.02191s
squeezenet1_0   0.38219s
mobilenet       0.53036s
mobilenet one group 2.59376s
mobilenet four group 2.26333s
-----CUDA False-----
--batch size = 1
resnet18        1.14055s
alexnet         0.37858s
vgg16           2.17955s
squeezenet1_0   0.46393s
mobilenet       1.27138s
mobilenet one group 1.59579s
mobilenet four group 1.26217s
--batch size = 4
resnet18        2.22609s
alexnet         0.66891s
vgg16           6.26908s
squeezenet1_0   1.56296s
mobilenet       2.62577s
mobilenet one group 3.63102s
mobilenet four group 2.69251s
--batch size = 32
resnet18        12.45708s
alexnet         2.78703s
vgg16           45.08901s
squeezenet1_0   9.72596s
mobilenet       15.53032s
mobilenet one group 22.75712s
mobilenet four group 17.14482s

Note, this on a pretty old GPU (GTX 745) and CUDA (8.0). You might want to try running on this as well and see what you get on your machine.

I definitely see an improvement when groups=input_channels compared to one group, but is at best maybe 5x for larger batch sizes on GPU, and only maybe a 1.5x improvement on CPU. That speedup certainly respectable, but I somewhat expected a larger speedup given the much greater reduction in params and ops depthwise convolutions should have given. Not sure if this is expected and matching that of other frameworks, or if this is a pytorch issue.

Note that you may be comparing different implementations. With

m1 = torch.nn.Conv1d(256,256,3,groups=1, bias=False).cuda()
m2 = torch.nn.Conv1d(256,256,3,groups=256, bias=False).cuda()
a = torch.randn(1,256,5, device='cuda')
b1 = m1(a)
b2 = m2(a)

I get:

In: b1.grad_fn
Out: <SqueezeBackward1 at 0x7f0f35be90b8>
In: b2.grad_fn
Out: <SqueezeBackward1 at 0x7f0ed1ec2c50>
In: b2.grad_fn.next_functions
Out: ((<ThnnConvDepthwise2DBackward at 0x7f0ed1f007f0>, 0),)
In: b1.grad_fn.next_functions
Out: ((<CudnnConvolutionBackward at 0x7f0ed1ebf780>, 0),)

So you would be comparing the non-grouped CuDNN convolution with the “native” fallback TH(Cu)NN in the grouped case (which isn’t - or at least wasn’t - supported by CuDNN so PyTorch needs to fall back to it’s own implementation). Now I didn’t look in great detail at the Cuda THNN implementation, but when I ported libtorch to Android, the CPU THNN convolution implementation involved unfold->matrix multiplication->fold and was hugely inefficient.
Of course, it would be highly desirable to have a more efficient native implementation, but it is quite a bit of work (e.g. for batch norm I managed to get the wall clock time on the GTX1080Ti close to CuDNN’s but that was a lot easier than I imagine convolutions to be - with things like using FFT sometimes and sometimes not etc.).

Best regards



Hi Tom,

What do you think about a proper method to test the performance of the depthwise_conv2d over normal conv layer? I tried to use 50 layers stacked up to test whether the depthwise one is faster than the normal one, is it fair and a straight-forward way?

In this github issue saying that even though depthwise conv has been implemented in cudnn7, on average they are no better than pytorch’s.

But I got no improvement in my own experiment as mentioned above.
Is there something need to config ( such as channels, kernel_size or backends ) specifically to use depthwise_conv2d ?

Thanks in advance

two stacked 50 layers model as follow:

layer1 : conv2d(3, 256, 3, padding=1, groups=1)
layer2 to layer49:conv2d(256, 256, 3, padding=1, groups=1)
layer50: conv2d(256, 10, padding=1, groups=1) # for crossentropy
         covn2d(256, 3, padding=1, groups=1) # for MSELoss

layer1 : conv2d(3, 256, 3, padding=1, groups=1)
layer2 to layer49:separable_conv2d(256, 256, 3)
layer50: conv2d(256, 10, padding=1, groups=1) # for crossentropy
         conv2d(256, 3, padding=1, groups=1) # for MSELoss

input and output:

random_input = torch.randn((1, 3, 256, 256))
random_output = torch.randint(low=0, high=10, size=(1,256,256)) # for crossentropy
random_output = torch.randn((1, 3, 256, 256)) # for MSELoss

Each model is trained on gpu, cuda 9.0, cudnn7, pytorch 1.0.1 post2.

Parameters and Foward & Backward time cost as follow:
CrossEntropyLoss and Adam optimizer:

Trainable Parameters:
Normal_conv2d    : 28354058
Separable_conv2d : 3311114
Time cost:
Normal_conv2d   : 0.5144641399383545s
Separable_conv2d: 0.5536670684814453s

CrossEntropy and SGD optimizer:

Trainable Parameters:
Normal_conv2d   : 28354058
Separable_conv2d: 3311114
Time cost:
Normal_conv2d   : 0.11238956451416016s
Separable_conv2d: 0.03952765464782715s

MSELoss and Adam optimizer:

Trainable Parameters:
Normal_conv2d   : 28337923
Separable_conv2d: 3294979
Time cost:
Normal_conv2d   : 0.5181684494018555s
Separable_conv2d: 0.5568540096282959s

MSELoss and SGD optimizer:

Trainable Parameters:
Normal_conv2d   : 28337923
Separable_conv2d: 3294979
Time cost:
Normal_conv2d   : 0.17907309532165527s
Separable_conv2d: 0.07207584381103516s

Note that :

  • separable_conv2d include depthwise_conv2d and pointwise_conv2d as mentioned in MobileNet.
  • model’s parameter is more with crossentropy loss due to the last layer out_channles.
  • It is faster using SGD optimizer and CrossEntropy loss.