For my project I am using
pytorch as a linear algebra backend. For the performance part of my code, I need to do 1D convolutions of 2 small (length between 2 and 9) vectors (1D tensors) a very large number of times. My code allows for batch-processing of inputs and thus I can stack a couple of input vectors to create matrices that can then be convolved all at the same time. Since
torch.conv1d does not allow for convolving along a single dimension for 2D inputs, I had to write a convolution function called
convolve (thank you @KFrank:Torch conv1d for two 2D matrices). This new function however consists of a double for-loop and is therefore very slow.
Question: how can I make the
convolve function perform faster through better code-design and let it be able to deal with batched inputs (=2D tensors)?
Partial answer: somehow avoid the double for-loop
Below are three jupyter notebook cells that recreate a minimal example. Note that the you need
line_profiler and the
%%writefile magic command to make this work!
%%writefile SO_CONVOLVE_QUESTION.py import torch def conv1d(a, v): padding = v.shape[-1] - 1 return torch.conv1d( input=a.view(1, 1, -1), weight=v.flip(0).view(1, 1, -1), padding=padding, stride=1 ).squeeze() def convolve(a, v): if a.ndim == 1: a = a.view(1, -1) v = v.view(1, -1) nrows, vcols = v.shape acols = a.shape expanded = a.view((nrows, acols, 1)) * v.view((nrows, 1, vcols)) noutdim = vcols + acols - 1 out = torch.zeros((nrows, noutdim)) for i in range(acols): for j in range(vcols): out[:, i+j] += expanded[:, i, j] return out.squeeze() x = torch.randn(5) y = torch.randn(7)
I write the code to the
SO_CONVOLVE_QUESTION.py because that is necessary for
line_profiler and to use as a setup for
Now we can evaluate the output and performance of the code above on non-batch input (
x, y) and batched input (
from SO_CONVOLVE_QUESTION import * # Without batch processing res1 = conv1d(x, y) res = convolve(x, y) print(torch.allclose(res1, res)) # True # With batch processing, NB first dimension! x_batch = torch.randn(5, 5) y_batch = torch.randn(5, 7) results =  for i in range(5): results.append(conv1d(x_batch[i, :], y_batch[i, :])) res1 = torch.stack(results) res = convolve(x_batch, y_batch) print(torch.allclose(res1, res)) # True print(timeit.timeit('convolve(x, y)', setup=setup, number=10000)) # 4.83391789999996 print(timeit.timeit('conv1d(x, y)', setup=setup, number=10000)) # 0.2799923000000035
In the block above you can see that performing convolution 5 times using the
conv1d function produces the same result as
convolve on batched inputs. We can also see that
convolve (= 4.8s) is much slower than the
conv1d (= 0.28s). Below we assess the slow part of the
convolve function WITHOUT batch processing using
%load_ext line_profiler %lprun -f convolve convolve(x, y) # evaluated without batch-processing!
Timer unit: 1e-07 s Total time: 0.0010383 s File: C:\python_projects\pysumo\SO_CONVOLVE_QUESTION.py Function: convolve at line 9 Line # Hits Time Per Hit % Time Line Contents ============================================================== 9 def convolve(a, v): 10 1 68.0 68.0 0.7 if a.ndim == 1: 11 1 271.0 271.0 2.6 a = a.view(1, -1) 12 1 44.0 44.0 0.4 v = v.view(1, -1) 13 14 1 28.0 28.0 0.3 nrows, vcols = v.shape 15 1 12.0 12.0 0.1 acols = a.shape 16 17 1 4337.0 4337.0 41.8 expanded = a.view((nrows, acols, 1)) * v.view((nrows, 1, vcols)) 18 1 12.0 12.0 0.1 noutdim = vcols + acols - 1 19 1 127.0 127.0 1.2 out = torch.zeros((nrows, noutdim)) 20 6 32.0 5.3 0.3 for i in range(acols): 21 40 209.0 5.2 2.0 for j in range(vcols): 22 35 5194.0 148.4 50.0 out[:, i+j] += expanded[:, i, j] 23 1 49.0 49.0 0.5 return out.squeeze()
Obviously a double for-loop and the line creating the
expanded tensor are the slowest. Are these parts avoidable with better code-design?