Torch.nn.functional.conv1d implementation

I am writing a custom operation, which uses a lot of torch.nn.functional.conv1d

I have two questions.

1, It seems that torch.nn.functional.conv1d is very slow. I would expect it to be implemented by fast Fourier transform and thus fast, but convoluting two same length vectors (I padded one periodically before feed it into the function) seems to be slower than matrix multiplying a vector. Is there a way to speedup this?

2, It seems that torch.nn.functional.conv1d does not support GPU. I experience the following error message:
“Expected object of type torch.FloatTensor but found type torch.cuda.FloatTensor for argument #4 ‘other’”
Also torch.nn.functional.conv1d does not support float 64, which makes gradient check difficult.
Is there a way to get around these, at least to get my operation running on GPU?

Any suggestions are appreciated.

1 Like
  1. The algorithm depends on the backend you are using. The FFT approach might not be the fastest in all cases.
    Try to set torch.backend.cudnn.benchmark = True at the beginning of your script. This should chose the fastest algorithm for your input size. Note that this might slow down your code, if your input size changes a lot.

  2. It seems your weight or input is not on the GPU. This code snippet works on the GPU:

x = torch.randn(1, 3, 24).to('cuda')
weight = torch.randn(6, 3, 3).to('cuda')

output = F.conv1d(x, weight)
print(output.device)
> cuda:0

Also float64 is working:

x = torch.randn(1, 3, 24, dtype=torch.double).to('cuda')
weight = torch.randn(6, 3, 3, dtype=torch.double).to('cuda')

output = F.conv1d(x, weight)
print(output.dtype)
> torch.float64
print(output.device)
> cuda:0
1 Like

Many thanks!

I figured out the problem in my code. I created some zero tensors for the padding, but I forget to create them on GPU.

I will test the benchmark trick later.

Just to add some remarks on the benchmark trick here, for the convenience of further readers.

On my program I cannot see performance increase by setting torch.backend.cudnn.benchmark = True, even though the dimension of my input is always the same. I will need to investigate this later. But still running on GPU is a lot faster than running on CPU (GPU MX150 with 2G memory, CPU 8th gen i5, 8G memory).

Many thanks to ptrblck for answering my question.

1 Like