Add scalar after a biasless convolution

Out of curiosity, what’s the fastest way of adding a scalar bias after a convolution?

  1. F.conv1d(x, weight) + eps
  2. F.conv1d(x, weight, torch.tensor(eps, dtype = x.dtype).expand(len(weight)))
  3. F.conv1d(x, weight, torch.full(len(weight), eps, dtype = x.dtype))

Does convolution call .contiguous() on bias before adding?

I’ve made some quick tests with the following script:

import timeit

setup = """

import torch
import torch.nn.functional as F

device = "cuda"

x = torch.rand(200, 3, 200, device=device)
weight = torch.rand(256, 3, 3, device=device)
eps = 1e-5

"""

t1 = "F.conv1d(x, weight) + eps"
t2 = "F.conv1d(x, weight, torch.tensor(eps, device=device).expand(len(weight)))"
t3 = "F.conv1d(x, weight, torch.full((len(weight), ), eps, device=device))"

number = 100

print("%d ms" % round(1000 * timeit.timeit(stmt=t1, setup=setup, number=number)))
print("%d ms" % round(1000 * timeit.timeit(stmt=t2, setup=setup, number=number)))
print("%d ms" % round(1000 * timeit.timeit(stmt=t3, setup=setup, number=number)))

It turns out that the first convolution is the fastest on GPU, but the slowest on the CPU. Moreover, the second convolution throws a cuDNN error.

As for the 2nd question, I guess it calls contiguous(), here’s a little snippet to confirm that:

import torch
import torch.nn.functional as F

device = "cuda"

x = torch.rand(200, 3, 200, device=device)
weight = torch.rand(256, 3, 3, device=device)
sample = torch.rand(512, device=device)

contiguous_bias = sample[::2].contiguous()
not_contiguous_bias = sample[::2]

res1 = F.conv1d(x, weight, contiguous_bias)
res2 = F.conv1d(x, weight, not_contiguous_bias)

res1.equal(res2) # Prints True

EDIT: I’m not sure I got the 2nd question right, the snippet above shows that even if you pass a non-contiguous bias to a convolution things are gonna work just fine. But if you are interested in a biasless convolution, I think your question will be more related to the summation kernel, which I think will make a contiguous call before running.

@levivana Thanks a lot for a detailed test. Results are interesting!

I’m not worrying that it will return a bad result if I don’t pass a non-contiguous tensor, I was basically wondering whether it makes sense to cache a materialized scalar bias tensor if I’m calling this kind of op many times (if the underlying conv call converts bias to contiguous, it makes sense to cache it to bypass the copying call).

Alright I got it !

BTW, there was a mistake is the previous snippet -> I forgot to synchronize the kernels :man_facepalming:

Here is the correct snippet (PS: I removed the t2 since it throws an error):

import timeit

setup = """

import torch
import torch.nn.functional as F

device = "cuda"

x = torch.rand(200, 3, 200, device=device)
weight = torch.rand(256, 3, 3, device=device)
eps = 1e-5
eps_t = torch.full((len(weight), ), eps, device=device)

"""

t1 = "F.conv1d(x, weight) + eps;torch.cuda.synchronize()"
t3 = "F.conv1d(x, weight, torch.full((len(weight), ), eps, device=device));torch.cuda.synchronize()"
t4 = "F.conv1d(x, weight, eps_t);torch.cuda.synchronize()"

number = 100

print("%d ms" % round(1000 * timeit.timeit(stmt=t1, setup=setup, number=number)))
print("%d ms" % round(1000 * timeit.timeit(stmt=t3, setup=setup, number=number)))
print("%d ms" % round(1000 * timeit.timeit(stmt=t4, setup=setup, number=number)))

And of course, the results are different ! Now, t3 is always faster than t1 :laughing:. If you store your eps in a tensor, it can give a slight improvement (about 5%) -> I measured it through t4.

Your test doesn’t check the expand version, but big thanks for the snippet anyway :slight_smile: Now it’s easy for me to test if expand time is same as materialized time.

1 Like