Why is `torch.pdist` so slow?

While torch.pdist being a native C++ function from pytorch, seems that it’s slower to use it than to just use torch.cdist and mask half of the matrix.

Here is the snippet code:

import torch

print(torch.__version__)
z1 = torch.randn(512, 512)

for device in ["cpu", "cuda"]:
    print(device)
    z1 = z1.to(torch.device(device))

    mask = torch.triu(torch.ones(z1.size(0), z1.size(0), dtype=bool, device=z1.device), diagonal=1)
    torch.testing.assert_close(torch.cdist(z1, z1)[mask], torch.pdist(z1))

    %timeit torch.pdist(z1)
    %timeit torch.cdist(z1, z1)[torch.triu(torch.ones(z1.size(0), z1.size(0), dtype=bool, device=z1.device), diagonal=1)]
    %timeit torch.cdist(z1, z1)[mask]

and here are the output when running in a colab notebook:

2.0.0+cu118
cpu
12.5 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.12 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.86 ms ± 231 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cuda
3.68 ms ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
320 µs ± 81.7 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
245 µs ± 5.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Then what is the point of torch.pdist?

1 Like

Your profiling is invalid for the GPU as you are not synchronizing the code and are profiling the dispatching and kernel launches at best or random syncs in the worst case.
Besides that you might also want to profile different shapes including warmup iterations.

Hi,
Thanks a lot for your answer. I’m not so familiar with profiling and honestly I did not fully understand all you said, but still it seems that:

  • on CPU at least my profiling looks fair and the speed of pdist is twice slower than cdist whereas there are technically twice less operations in pdist than cdist and no CUDA parallelization black magic should be involved
  • on GPU my measures might not be super precise but still the difference of speed is an order of magnitude so it’s definitely enough to observe that pdist is slower than cdist
  • I reported the results only for one shape for simplicity and readability but observed similar behavior with different shapes. I chose (512*512) because it’s a reasonable size for a real-world usage, moreover pdist should be better than cdist particularly when the size of the tensor increases

Overall, I’m not interested in precisely benchmarking pdist and cdist, I just wanted to point out that the implementation of torch.pdist looks extremely suboptimal, hence my minimal example. Do you know how pdist is implemented and why the gap between pdist and cdist is so huge?

I understand that you are not interested in precisely profiling workloads, but as already pointed out your profiling is invalid.

You could see the implementation of pdist here and cdist here.