Torch.norm 3-6x slower than manually calculating sum of squares?

I just happened to notice that computing the L2 distance between two tensors is consistently 3-6 times slower when using the builtin torch.norm versus manually computing the square root of the sum of squares.

x = torch.randn(1024, 256)
y = torch.randn(1024, 256)
%%timeit
torch.sqrt((x - y).pow(2).sum(1))

333 µs ± 2.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
torch.norm(x - y, 2, 1)

2.01 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Is this to be expected? Does anyone know why there is such a difference in performance here?

1 Like

That doesn’t sound good… I’ve opened an issue for it here: https://github.com/pytorch/pytorch/issues/5671

I ran a few test on GPUs (Titan Xp, 1080Ti and Tesla P100), did not observe this phenomenon. I got very similar timings on GPU for both functions. I also tried it with larger tensors.

@Latope2-150 do you notice this phenomenon on the CPU?

Yes, same performance issue on CPU in term of magnitude. Tried it with my i7-7600U on a laptop and with an E5-2680 (but I was not the only one using the server so it might not be a fair test). I noticed the difference is even more important when the tensor are smaller.

I still have this problem on the CPU (although not on the GPU). I have compiled the latest version of PyTorch from source (commit #542c273) with MKL-DNN. torch.norm (L2) seems to be about 3-5 times slower than a square-root of a sum of squares.

Here is my code to reproduce this. On my system, this code takes 383.1 seconds with line #1, and 75.2 seconds with line #2.

import torch
import time

loss = 0

A = torch.rand(10, 100, 1000)
A.requires_grad = True

start = time.time()

for i in range(0, 20000):

	B = torch.norm(A, dim=0)  #1

	# B = torch.sqrt(torch.sum(A * A, dim = 0))  #2

	loss += torch.sum(B)

loss.backward()
end = time.time()

print(end - start)

May 29th 2019, running PyTorch 1.1.0 on Intel® Core™ i7-6920HQ CPU @ 2.90GHz

The difference persists, get ~13x slowdown on ~14,000 length vector:

%%timeit
distances = torch.norm(vertices - point_locs, p=2, dim=1)
3.63 ms ± 43.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

vs

%%timeit
distances = torch.sqrt((vertices - point_locs).pow(2).sum(1))
271 µs ± 5.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1 Like

I also report it. It is easily reproduce on Colab problem still remain.

x = torch.randn(1024, 256)
y = torch.randn(1024, 256)
%timeit torch.sqrt((x - y).pow(2).sum(1))
%timeit torch.norm(x - y, 2, 1)
x = x.cuda()
y = y.cuda()
%timeit torch.sqrt((x - y).pow(2).sum(1))
%timeit torch.norm(x - y, 2, 1)

results

1000 loops, best of 3: 465 µs per loop
100 loops, best of 3: 4.23 ms per loop
The slowest run took 5.25 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 44 µs per loop
The slowest run took 4.03 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 24 µs per loop