- a Convxd with kernel size 1?
- a Linear layer with appropriate
view
s?
I’ve created a small comparison using nn.Conv2d
and nn.Linear
for a random input of [10, 64, 128, 128]
.
As I’ve used unfold
and permute
for the linear implementation, it’s slower on my machine compared to the conv approach. Maybe you have another idea how to avoid the (expensive) shape translations.
Here is the gist.
Hi Patrick, thanks. In fact you can implement a pointwise convolution directly in Linear, since a pointwise convolution is in fact just a Linear operation:
import time
import torch
from torch import nn
N = 128
seq_len = 5
embedding_size = 32
input = torch.rand(N, embedding_size, seq_len)
c1 = nn.Conv1d(embedding_size, embedding_size, kernel_size=1, padding=0, bias=False)
start_time = time.time()
for it in range(500):
output = c1(input)
print('c1 time', time.time() - start_time)
print('output[0,:,0]', output[0,:,0])
# use linear...
c2 = nn.Linear(embedding_size, embedding_size, bias=False)
c2.weight.data[:] = c1.weight.data.view(embedding_size, embedding_size)
start_time = time.time()
for it in range(500):
output = c2(input.transpose(-2, -1)).transpose(-1, -2)
print('c2 time', time.time() - start_time)
print('output[0,:,0]', output[0,:,0])
(Edit: note that I wrote this code after posting the question)
Yeah, you are right. Forget my gist. It’s way to complicated.
Somehow I got stuck thinking about how the “window” moves along the input volume and made the code too complicated.
What times do you get? Is the conv still faster on your machine? I’ve updated my code with your approach and while the linear approach for the 1D case is faster than conv, it seems to be the contrary for the 2D case.
I only tried 1d case, and only tried on cpu, where Linear was 10 times faster. Interesting that conv2d is in fact faster than linear in the 2d case. Since they are mathematically equivalent, I wonder if this means we should use pointwise conv2ds instead of Linears for certain geometries???
I’m not convinced yet. Let me try it on a GPU on a Server.
(actually, as a very off-topic aside, when I wrote DeepCL, I was too lazy to write a Linear implementation, and just implemented it as a convolution where the kernel width and height exactly equaled the widht and heiht of hte incoming image https://github.com/hughperkins/DeepCL/blob/master/src/fc/FullyConnectedLayer.cpp#L29 )
It’s an interesting approach
In fact, the convolutions seems to be faster. Here are the timings for a GTX1070 and an old i7 CPU:
GPU, 2d, 500 repetitions:
Conv2d: 0.14418 sec
Linear: 1.52989 sec
CPU, 2d, 500 repetitions:
Conv2d: 6.34300 sec
Linear: 11.29293 sec
--------------------------
GPU, 1d, 500 repetitions:
Conv1d: 0.147367 sec
Linear: 1.52954 sec
CPU, 1d, 500 repetitions:
Conv1d: 6.43247 sec
Linear: 12.29751 sec
Note that the difference between the 1D and 2D case should be minimal, as I’ve used h, w = 128, 128
for 2D and h = 128*128
for 1D.
Maybe the workloads are unrealistic?
Interesting. Thats surprising. And quite a big difference, in favor of using conv layers.
I guess it’s due to the transpose
calls, which are not necessary for the conv layers.
Ah, that sounds plausible.
Hi, It seems there is no difference between conv1d and linear except the efficiency, so how can I choose which one to use for pointwise convolutions?