Preferred/most efficient way of implementing pointwise convolutions?

hughperkins · July 29, 2018, 10:16pm

a Convxd with kernel size 1?
a Linear layer with appropriate views?

ptrblck · July 29, 2018, 10:45pm

I’ve created a small comparison using nn.Conv2d and nn.Linear for a random input of [10, 64, 128, 128].
As I’ve used unfold and permute for the linear implementation, it’s slower on my machine compared to the conv approach. Maybe you have another idea how to avoid the (expensive) shape translations.
Here is the gist.

hughperkins · July 29, 2018, 10:51pm

Hi Patrick, thanks. In fact you can implement a pointwise convolution directly in Linear, since a pointwise convolution is in fact just a Linear operation:

import time

import torch
from torch import nn



N = 128
seq_len = 5
embedding_size = 32

input = torch.rand(N, embedding_size, seq_len)

c1 = nn.Conv1d(embedding_size, embedding_size, kernel_size=1, padding=0, bias=False)
start_time = time.time()
for it in range(500):
    output = c1(input)
print('c1 time', time.time() - start_time)
print('output[0,:,0]', output[0,:,0])

# use linear...

c2 = nn.Linear(embedding_size, embedding_size, bias=False)
c2.weight.data[:] = c1.weight.data.view(embedding_size, embedding_size)
start_time = time.time()
for it in range(500):
    output = c2(input.transpose(-2, -1)).transpose(-1, -2)
print('c2 time', time.time() - start_time)
print('output[0,:,0]', output[0,:,0])

(Edit: note that I wrote this code after posting the question)

ptrblck · July 29, 2018, 11:20pm

Yeah, you are right. Forget my gist. It’s way to complicated.
Somehow I got stuck thinking about how the “window” moves along the input volume and made the code too complicated.

What times do you get? Is the conv still faster on your machine? I’ve updated my code with your approach and while the linear approach for the 1D case is faster than conv, it seems to be the contrary for the 2D case.

hughperkins · July 29, 2018, 11:23pm

I only tried 1d case, and only tried on cpu, where Linear was 10 times faster. Interesting that conv2d is in fact faster than linear in the 2d case. Since they are mathematically equivalent, I wonder if this means we should use pointwise conv2ds instead of Linears for certain geometries???

ptrblck · July 29, 2018, 11:27pm

I’m not convinced yet. Let me try it on a GPU on a Server.

hughperkins · July 29, 2018, 11:27pm

(actually, as a very off-topic aside, when I wrote DeepCL, I was too lazy to write a Linear implementation, and just implemented it as a convolution where the kernel width and height exactly equaled the widht and heiht of hte incoming image https://github.com/hughperkins/DeepCL/blob/master/src/fc/FullyConnectedLayer.cpp#L29 )

ptrblck · July 29, 2018, 11:44pm

It’s an interesting approach

In fact, the convolutions seems to be faster. Here are the timings for a GTX1070 and an old i7 CPU:

GPU, 2d, 500 repetitions:
Conv2d: 0.14418 sec
Linear: 1.52989 sec

CPU, 2d, 500 repetitions:
Conv2d: 6.34300 sec
Linear: 11.29293 sec

--------------------------

GPU, 1d, 500 repetitions:
Conv1d: 0.147367 sec
Linear: 1.52954 sec

CPU, 1d, 500 repetitions:
Conv1d: 6.43247 sec
Linear: 12.29751 sec

Note that the difference between the 1D and 2D case should be minimal, as I’ve used h, w = 128, 128 for 2D and h = 128*128 for 1D.
Maybe the workloads are unrealistic?

hughperkins · July 29, 2018, 11:48pm

Interesting. Thats surprising. And quite a big difference, in favor of using conv layers.

ptrblck · July 29, 2018, 11:51pm

I guess it’s due to the transpose calls, which are not necessary for the conv layers.

hughperkins · July 29, 2018, 11:52pm

Ah, that sounds plausible.

Xuyang_Bai · April 15, 2019, 1:28pm

Hi, It seems there is no difference between conv1d and linear except the efficiency, so how can I choose which one to use for pointwise convolutions?