Speed up fully-separable 1D convolution?

Suppose I have a 1D convolutional layer with 2 input channels, 32 output channels, and length 9 kernels. My weight tensor has a very special structure: it can be expressed as an “outer product” of three tensors as seen below, where I generate a dummy weight matrix and some dummy data of this form and calculate the convolution using conv1d:

import torch
import torch.nn.functional as F

in_channels = 2
out_channels = 32
kernel_size = 9
nsamples = 2**12
batch_size = 1
padding = kernel_size//2
x = torch.randn(batch_size, in_channels, nsamples)

outspace = torch.randn(out_channels,1,1)
inspace = torch.randn(1,in_channels,1)
kernelspace = torch.randn(1,1,kernel_size)
w = outspace*inspace*kernelspace
y = F.conv1d(x,w,padding=padding)

The space of all weight tensors of the same shape as w shape requires in_channels*out_channels*kernel_size parameters to describe, but the restricted space of tensors that factor as above only requires in_channels+out_channels+kernel_size parameters to describe. Correspondingly, there should be a sequence of simpler operations that calculates F.conv1d(x,w) as above with far fewer multiplies when w has the special structure above. Something like

y = op1(kernelspace,x)
y = op2(inspace, y)
y = op3(outpsace, y)

For some op1, op2, and op3. I can write out the math for how to do it but I’m wondering what’s the fast way to do this in pytorch? I imagine group convolution and clever use of matmul would do it, but I’m not familiar enough with the API to formulate it.