Depthwise deformable convolutions are slow

Hi

I’m working on a project where I use deformable convolutions to shift features in the spatial dimension. Since I only want to shift them spatially it seemed logical to use depthwise convolutions which can be done by using groups = in_channels = out_channels as far as I understand.

Unfortunately the deformable convolutions implemented in torchvision.ops.DeformConv2d as well as the ones implemented in mmdetection are very slow when using groups = in_channels (see time measurements below).

I’d like to avoid using full convolution kernel (3x3xin_channels) because I don’t want to increase the capacity of my network too much. Is there a way to a) either speed up the deformable convolutions with group option, or b) use the same 3x3 kernel for each of the input channels?

Thanks in advance =)

Script used to measure times: https://gist.github.com/MauroPfister/6643169d1b0d9b656277dbaaa8548745

# Output of deform_conv_speed.py

bs: 01, in-channels: 512, groups: 1
Time normal conv:                0.35 ms
Time torchvision deform conv:    0.37 ms
Time mmdet deform conv:          0.38 ms
----------------------------------------
bs: 01, in-channels: 512, groups: 1
Time normal conv:                0.35 ms
Time torchvision deform conv:    0.34 ms
Time mmdet deform conv:          0.31 ms
----------------------------------------
bs: 16, in-channels: 512, groups: 512
Time normal conv:                0.31 ms
Time torchvision deform conv:   15.46 ms
Time mmdet deform conv:         14.05 ms
----------------------------------------
bs: 16, in-channels: 512, groups: 512
Time normal conv:                0.35 ms
Time torchvision deform conv:   15.46 ms
Time mmdet deform conv:         14.06 ms
----------------------------------------

You have a typo in your profiling script and are using the mmdetection implementation twice instead of the torchvision one. :wink:
That being said, you can see the torchvision CUDA implementation here, which uses an im2col approach and applies the bilinear interpolation.

Note that “normal” convs are heavily accelerated for “typical” shapes in cudnn and an im2col approach would not yield the same performance.
The paper mentions

Both are based on the idea of augmenting the spatial sampling locations in the modules with additional offsets and learning the offsets from the target tasks, without additional supervision.

which seems to learn the offsets and thus selecting a fast kernel for a specific workload seems to not work out of the box. :confused:

1 Like

Thanks @ptrblck for having a look at my script and spotting the typo! I updated both the profiling output and the script.

I already feared that it would not be straight forward to optimize for the grouped convs case. A possible alternative for me would be a applying the same 3x3 kernel to all channels. I read that something like this can be done by merging batch and channel dimensions together, performing the conv and then splitting them again. However, in my case I have different offsets for each image in a batch, so I cannot perform the exact same convolution over all channels in a batch (even though the kernel would be the same).

Do you have any idea how to solve this?

Yes, the dimension wrapping would work.
Could you explain what the offsets are exactly doing, as I’m not deeply familiar with the implementation?

The offsets determine the sampling locations of the kernel at each point in the output map. This article explains it very well (especially the first image). For example a 3x3 deformable convolution on a (h, w) input has an “offset map” of (18, h, w). 18 because 9 x (x,y) coordinates for the sampling locations.

These offset maps are calculated with another small CNN based on the input, so they are different for every image in the batch.

I hope this makes things a bit more clear =)

I think I found a rather ugly hack to increase the performance of my network.

If I understood the concept correctly, a grouped (depthwise) convolution is essentially a normal convolution where only the i-th channel of the i-th kernel contains non zero weights. That means I can train my network using grouped convolutions (groups = in_channels) and during inference I use full convolutions (groups = 1) but I initialize the kernels with 0 and insert the previously trained kernels at the i-th channel of the i-th kernel. I think the code snippet below illustrates better what I’m doing.

# model.deform_conv.weight.shape = [512, 512, 3, 3]
# pretrained_kernels.shape = [512, 1, 3, 3]

for i in range(model.deform_conv.weight.shape[0]):
    model.deform_conv.weight[i, i, :, :] = pretrained_kernels[i, 0, :, :]

I modified my code accordingly and can confirm that the results are exactly the same as with grouped convolutions but the inference time is decreased by a factor 8.

Now I’m wondering if this special case of grouped convolutions is really that hard to implement? Couldn’t the “hack” that I explained above also be implemented on CUDA level?

@ptrblck Maybe you can tell me if this is feasible or not? Thanks a lot!

1 Like

That’s a neat trick! :slight_smile:
Are you recreating these “zero-padded” convolutions in each iteration or are you zeroing out the gradients?
Did you also verify that you get the same gradients in both approaches?

For the moment I use this approach only during inference. That is, I train with grouped convolutions (which does not seem to be much slower) and modify the network afterwards for faster inference.

I did not think about using the same approach during training, but it is a great idea! Do you think it would be sufficient to initialize theses “zero-padded” kernels once in the beginning because all the zero weights would generate a zero gradient?
If that does not work, at which point would I have to zero out the gradients? Right before taking the optimizer step?

Just for the sake of completeness: The profiling times in my first post are not correct, but I cannot edit the post anymore. Here are the correct ones:

# Output of deform_conv_speed.py

bs: 01, in-channels: 512, groups: 1
Time normal conv:                0.37 ms
Time torchvision deform conv:    0.37 ms
Time mmdet deform conv:          0.36 ms
----------------------------------------
bs: 01, in-channels: 512, groups: 512
Time normal conv:                0.19 ms
Time torchvision deform conv:   15.23 ms
Time mmdet deform conv:         13.67 ms
----------------------------------------
bs: 16, in-channels: 512, groups: 1
Time normal conv:                1.15 ms
Time torchvision deform conv:    2.26 ms
Time mmdet deform conv:          2.19 ms
----------------------------------------
bs: 16, in-channels: 512, groups: 512
Time normal conv:                0.32 ms
Time torchvision deform conv:   15.06 ms
Time mmdet deform conv:         13.76 ms
----------------------------------------

Are you recreating these “zero-padded” convolutions in each iteration or are you zeroing out the gradients?
Did you also verify that you get the same gradients in both approaches?

I tried implementing @ptrblck’s idea and checking it on a toy example. Currently I’m using torch.equal() to compare the gradients. They do not match up in 100% of the cases but almost always. I’m not sure if this is due to some numerical errors or if there is something wrong in my script.