Your profiling is invalid for the GPU as you are not synchronizing the code and are profiling the dispatching and kernel launches at best or random syncs in the worst case.
Besides that you might also want to profile different shapes including warmup iterations.

Hi,
Thanks a lot for your answer. I’m not so familiar with profiling and honestly I did not fully understand all you said, but still it seems that:

on CPU at least my profiling looks fair and the speed of pdist is twice slower than cdist whereas there are technically twice less operations in pdist than cdist and no CUDA parallelization black magic should be involved

on GPU my measures might not be super precise but still the difference of speed is an order of magnitude so it’s definitely enough to observe that pdist is slower than cdist

I reported the results only for one shape for simplicity and readability but observed similar behavior with different shapes. I chose (512*512) because it’s a reasonable size for a real-world usage, moreover pdist should be better than cdist particularly when the size of the tensor increases

Overall, I’m not interested in precisely benchmarking pdist and cdist, I just wanted to point out that the implementation of torch.pdist looks extremely suboptimal, hence my minimal example. Do you know how pdist is implemented and why the gap between pdist and cdist is so huge?