I just happened to notice that computing the L2 distance between two tensors is consistently 3-6 times slower when using the builtin torch.norm versus manually computing the square root of the sum of squares.
x = torch.randn(1024, 256)
y = torch.randn(1024, 256)
%%timeit
torch.sqrt((x - y).pow(2).sum(1))
333 µs ± 2.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
torch.norm(x - y, 2, 1)
2.01 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Is this to be expected? Does anyone know why there is such a difference in performance here?
I ran a few test on GPUs (Titan Xp, 1080Ti and Tesla P100), did not observe this phenomenon. I got very similar timings on GPU for both functions. I also tried it with larger tensors.
Yes, same performance issue on CPU in term of magnitude. Tried it with my i7-7600U on a laptop and with an E5-2680 (but I was not the only one using the server so it might not be a fair test). I noticed the difference is even more important when the tensor are smaller.
I still have this problem on the CPU (although not on the GPU). I have compiled the latest version of PyTorch from source (commit #542c273) with MKL-DNN. torch.norm (L2) seems to be about 3-5 times slower than a square-root of a sum of squares.
Here is my code to reproduce this. On my system, this code takes 383.1 seconds with line #1, and 75.2 seconds with line #2.
import torch
import time
loss = 0
A = torch.rand(10, 100, 1000)
A.requires_grad = True
start = time.time()
for i in range(0, 20000):
B = torch.norm(A, dim=0) #1
# B = torch.sqrt(torch.sum(A * A, dim = 0)) #2
loss += torch.sum(B)
loss.backward()
end = time.time()
print(end - start)
I also report it. It is easily reproduce on Colab problem still remain.
x = torch.randn(1024, 256)
y = torch.randn(1024, 256)
%timeit torch.sqrt((x - y).pow(2).sum(1))
%timeit torch.norm(x - y, 2, 1)
x = x.cuda()
y = y.cuda()
%timeit torch.sqrt((x - y).pow(2).sum(1))
%timeit torch.norm(x - y, 2, 1)
results
1000 loops, best of 3: 465 µs per loop
100 loops, best of 3: 4.23 ms per loop
The slowest run took 5.25 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 44 µs per loop
The slowest run took 4.03 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 24 µs per loop