Performing convolutions in groups (but not grouped convolution!)

marcman411 · October 28, 2019, 7:32pm

I have a tensor with a batch of multi-channel 2D image patches and I want to efficiently apply the same convolutional filters to each image patch. How can I do this?

Let’s say:

B = # batches
N = # patches per batch
C = # input channels per patch
D = # output channels per patch
H = height of patch
W = width of patch

My input tensor x is B x N x C x H x W.
Given a convolution function conv(x), I want to be output to output B x N x D x H x W.

Typically, these patches are spatially fairly small, on the order of 8x8 pixels or so. However, N can be like 20,000+.

I have tried collapsing the first two dimensions to a BN x C x H x W tensor, but it appears that PyTorch’s convolution implementation is designed such that the convolution operation is parallelized per batch within a for-loop. Thus, I’m just looping over BN batches sequentially. This is quite slow (I’ve tested it) and doesn’t really take advantage of a GPU at all as it’s largely sequential.

The other thing I looked at was a grouped convolution. I collapsed dimensions 1 and 2 so I had a B x NC x H x W tensor, and then set groups=N. However this results in a B x D x H x W output, whereas I would be looking for a B x ND x H x W output. If I explicitly provide that multiplier to the number of channels (e.g. out_channels=N*1024 or something), however, PyTorch tries to initialize a massive weight matrix for the convolution, which is certainly unnecessary and obviously exceeds the capabilities of my device.

Finally, I considered tiling the patches so that the tensor was B x C x NH x W (or some variation thereof), but this would now require some funky striding or padding to make sure I wasn’t sampling between patches.

In summary, I am looking for an efficient operation that will apply a D x C x K x K weight matrix to my B x N x C x H x W input tensor and produce a B x N x D x H x W output tensor. That is, a convolution in groups, but not a grouped convolution. Is this possible using the existing tools in PyTorch? Everything is so nicely aligned in memory that is seems like it ought to be. I am open to any necessary permutation of dimensions.

albanD · October 28, 2019, 8:05pm

Hi,

Doing the BN x C x H x W is the right thing to do here I think.
Why do you think it runs over batch sequentially? Which device are you using to run this (CPU or GPU)?

marcman411 · October 28, 2019, 8:14pm

Hmm. Interesting. The code base has definitely changed a bit since I last dived into the convolution implementations about a year ago, but this THCUNN implementation appears to use a for-loop. I am using the GPU backend.

I guess I’m also just surprised by how poorly this scales if it’s not sequential. When testing such that the actual spatial convolution is fixed (i.e. C x H x W doesn’t change). My network with N=1280 processes a batch in ~0.4 seconds on average. When N=20480, takes almost 5 seconds. That’s linear scaling, which seemed to suggested that these batches are being processed sequentially.

albanD · October 28, 2019, 8:22pm

That’s linear scaling, which seemed to suggested that these batches are being processed sequentially.

Or that it is already compute bound at N=1280
If you’re using the GPU backend in particular, check nvidia-smi to see what the utilisation is.

Also you are most likely not using the THCUNN code but cudnn.
If you have especially unusual sizes (compared to classical resnet-like CNN), try setting torch.backends.cudnn.benchmark = True. This will allow cudnn to test more algorithms (so the first time it sees a new size it will be slower, but subsequent runs should be faster).

marcman411 · October 28, 2019, 8:26pm

I can’t imaging we’re at the compute bounds though. I’d thought of that, but H=W=4 in those tests and C scales by powers of 2. Effectively, I am running this with data equivalent to input image sizes of 3x256x512 in batches of 4 on each GPU. GPU utilization is at 100% though… I honestly don’t understand why this should throttle stuff so badly. I’m using a GTX 1080T and I’ve historically had better performances than this with larger image sizes.

I will try cudnn benchmark. Maybe that could help.

marcman411 · October 28, 2019, 8:56pm

I benchmarked against the same number of elements but arranged as a B x C X H x W tensor with N evenly distributed among H x W and it takes about 0.3 seconds per batch with N=20480. I don’t think it’s a compute issue. Even if batches are parallelized, I think it must be to a lesser extent (or have inherently less gain from parallelization) than the spatial convolutions.

albanD · October 28, 2019, 9:13pm

Hi,

You might be bound by the cuda kernel launches as well if each operation is very small and you don’t use cudnn.
If you use cudnn, the benchmark mode should help.

marcman411 · October 28, 2019, 9:17pm

The benchmark brought the batch time down to ~3.5s on average, so it definitely helped (thanks!), but I think there’s gotta be more to the story.

albanD · October 28, 2019, 9:26pm

Do you have a small code sample I can use to experiment locally? Because I’m not sure what the 3.5s means here. (Input size / batch size / cuda,cudnn version / hardware etc)

One common cuda perf pitfalls is when we end up launching many many kernels which do very small tasks. So you spend more time launching the kernel than doing compute.

This can be because of the internal implementation (THCUNN or cudnn) that does some assumptions (small batch size here) that you don’t.
This can be because your code does these ops and it should be batched differently.

For the first point, there is no magic kernel to perform all convolutions, so many specialized kernels are needed to get the best perf. It’s possible that such specialized kernel is not in cudnn

marcman411 · October 28, 2019, 9:30pm

Let me see if I can produce a minimal example, and I’ll get back to you. Some of this is intertwined with other stuff I’m working on. I’ll try to throw together a mini-profiler.

In theory, this should lend itself nicely to the im2col convolution variant at the very least. Instead of parallelizing the copy over C x H x W, it would parallelize it over N x C x H x W (as NHW in this case is roughly the same as HW in a Normal spatial convolution over larger images) and have some extra checks in the double for-loop over the filter data to make sure that it’s not copying from the wrong “patch”.

marcman411 · October 29, 2019, 5:17am

So I went and rolled my own “Patch Convolution,” reimplementing im2col and col2im with a 5-D tensor (B x C x P x H x W) and launch it with C * P * H * W threads. It’s definitely slower than cudnn would be (my CUDA skills are naive at best), but it runs the same network as described above in ~1.6s per batch. I know it’s not really possible to really demonstrate the difference without that profiler I promised, but it seems to me that this result suggests that the “optimal kernel” for my problem just might not be in the cudnn library, as you posited.

In the meantime, I can live with my 5x slowdown for now…

On another note, this operation is totally something that PyTorch should support though. It’s a fairly trivial change from the standard im2col for conv2D. The only major differences are in how we stride and index. I’m happy to share my baseline implementation, but I am sure the PyTorch devs can do a better job than me.

ptrblck · October 29, 2019, 8:58am

I’m a bit confused about the numbers.

Did you make sure to synchronize the device before starting and stopping the timer (or did you profile the kernels via nsight)?
For cudnn benchmark, some warm-up iterations are necessary in addition as @albanD explained.

What is your use case by the way?
If you want to apply soma operation on each small kernel, you might want to unfold the image, and perform the op e.g. via matmul.

marcman411 · October 29, 2019, 1:56pm

I used torch.cuda.synchronize() to profile. My use case is simply convolving on a set of many many patches. unfold might do the trick actually. I’ve been looking through the docs, and correct me if I’m wrong, but isn’t this basically creating the “column” matrix from the im2col operation?

Once I get some stuff going, I will try to put together a proper profile still.

ptrblck · October 29, 2019, 2:02pm

Yes, unfold an be used to apply the im2col operation, which might be cheaper for your use case instead of a lot of small convolutions.

marcman411 · October 29, 2019, 2:07pm

Yeah, so that’s what I implemented before that gave me the best speed up so far–except I went and wrote my own CUDA kernels. Something tells me unfold and fold are probably much more efficient than my im2col and col2im… ;-). I’ll throw this in and see what kind of speed it brings. Do unfold and fold leverage cudnn? That is, will torch.backends.cudnn.benchmark = True add anything?

albanD · October 29, 2019, 2:21pm

Hi,

No cudnn does not provide such functions.
But unfold and fold are mostly playing with size and stride, not doing anything. on the GPU really so you don’t need cuda kernels for them right?
Then you only need a matmul which is already implemented.

Does something like this works for you?

inp = torch.randn(1, 3, 10, 12)
w = torch.randn(2, 3, 4, 5)

inp_unf = torch.nn.functional.unfold(inp, (4, 5))
out_unf = inp_unf.transpose(1, 2).matmul(w.view(w.size(0), -1).t()).transpose(1, 2)
out = torch.nn.functional.fold(out_unf, (7, 8), (1, 1))

marcman411 · October 29, 2019, 5:22pm

Alright, I put together a small profiler comparison between the different methods we talked about. It’s also totally possible my profiler is not timing things most accurately. Please feel free to correct me if I’m doing something incorrectly.

The profiler compares:
(1) torch.nn.Conv2D on a BP x C x H x W tensor
(2) My custom patch_im2col CUDA kernel on a B x C x P x H x W tensor
(3) torch.nn.unfold + torch.matmul on a B x P x C x H x W tensor

I made a GitHub repo so you can actually run it locally or inspect the code, if you’re interested in trying that.

The forward pass results averaged over 20 iterations on a single NVIDIA GTX 1080T:

Parameters:
B = 1
P = 20480
H = 4
W = 4
in_channels = 32
out_channels = 64
kernel_size = 3
padding = 1
stride = 1
dilation = 1

Various patch convolution methods:
  Using-unfold avg. time: 0.058497 seconds
  Rolled-into-batch avg. time: 0.089629 seconds
  Custom-patch-im2col avg. time: 0.036741 seconds


B = 1
in_channels = 32
H = 600
W = 600
Compare to traditional convolution with B x C x H x W inputs with similar number (actually more) of elements:
  nn.Conv2d: 0.037487 seconds

This test doesn’t really differentiate in an obvious way, so maybe it’s the backward pass that differentiates the speed. I am not 100% sure how to profile that though. Perhaps either of you could shed some light? @albanD @ptrblck

albanD · October 29, 2019, 6:11pm

Hi,

Thanks for the code.
I quickly tried it.

Your imports in your custom versions import modules from another package (old install I guess). They should be import _patchconv_ext._patch_convolution as patch_conv (and same for transposed).
Running with this I get a cuda kernel failed for your code though Do you see the same thing?
What are the timings you get with the default settings? So that I can compare with what I get.

I slightly modified your test script (and removed your custom code that does not run here).
You can find in this gist https://gist.github.com/albanD/e57425691686b708b3d57329d633ad33 the results I have locally.
The conclusions I have:

cudnn benchmark does give the expected results for the forward pass (close to the conv with similar number of ops)
For the backward, not sure why the unfold version is soo good and the others not. I expect the expands lead to very efficient backward?

I tested on CUDA 10.0 with cudnn 76.02 on a Quadro GP100

marcman411 · October 29, 2019, 6:32pm

So I’m still not quite sure how to handle kernel fails and properly print the error. I thought what I had worked, but I’ve run into issues where it just says “kernel failed: no error”. Probably an out of memory issue? The GPU I tested on has ~12GB of RAM. I also have CUDA_NUM_THREADS=512. Maybe that value needs to change for your setup? It’s all hardcoded, unfortunately.

Also, I fixed those imports shortly after posting–this was all tied up in a different module at one point. It should work now.

In any case, I ran your better profiler script locally and un-commented custom code, which I ran as well. Here are the results on a GTX 1080T with CUDA 10.1 (still looking for the cudnn version) and PyTorch 1.2

######################### CUDNN FALSE


Various patch convolution methods:
  Using-unfold avg. time: 0.054104 seconds
  Rolled-into-batch avg. time: 0.996649 seconds


Compare to traditional convolution with B x C x H x W inputs with similar number (actually more) of elements:
  nn.Conv2d: 0.041167 seconds
######################### CUDNN TRUE


Various patch convolution methods:
  Using-unfold avg. time: 0.044344 seconds
  Rolled-into-batch avg. time: 0.072312 seconds
  Custom-patch-im2col avg. time: 0.037970 seconds


Compare to traditional convolution with B x C x H x W inputs with similar number (actually more) of elements:
  nn.Conv2d: 0.038860 seconds
######################### CUDNN TRUE BENCH


Various patch convolution methods:
  Using-unfold avg. time: 0.044354 seconds
  Rolled-into-batch avg. time: 0.040456 seconds
  Custom-patch-im2col avg. time: 0.038056 seconds


Compare to traditional convolution with B x C x H x W inputs with similar number (actually more) of elements:
  nn.Conv2d: 0.044387 seconds
######################### CUDNN TRUE BENCH


Various patch convolution methods:
  Using-unfold avg. time: 0.044427 seconds
  Rolled-into-batch avg. time: 0.035127 seconds
  Custom-patch-im2col avg. time: 0.037969 seconds


Compare to traditional convolution with B x C x H x W inputs with similar number (actually more) of elements:
  nn.Conv2d: 0.038883 seconds

It’s also possible that the unfold approach is taking advantage of not re-allocating memory each time? My method treats each layer independently and forward and backward passes re-allocate intermediate storage. Perhaps there’s some optimization going on in the PyTorch backend to not re-allocate and destroy GPU memory so much. Perhaps the intermediate storage is fixed through a full network pass somehow? I’m speculating now, but it might explain the performance gains.

albanD · October 29, 2019, 6:44pm

Yes a deeper investigation would be needed for the unfold

Does this mean that the cudnn perf with benchmark = True match your expectations?