Pdist vs. cdist performance

I have seen this question asked in the past but I am getting a strange result where pdist is almost 20 times slower than cdist. I believe that in the previous versions of pytorch I did not see this difference (I am using 2.1.1)
Here are the steps to reproduce my timings:

import torch
import scipy.spatial.distance # to convert pdist vector to matrix

x = torch.randn(3000,200,device=‘cuda’,dtype=torch.float32)
x_unsqueeze = x.unsqueeze(0) # need batch dimension for cdist

pdist = torch.pdist(x).cpu().numpy()
pdist = scipy.spatial.distance.squareform(pdist)

cdist = torch.cdist(x_unsqueeze,x_unsqueeze).squeeze().cpu().numpy()
np.fill_diagonal(cdist,0) # set diagonal to 0 for comparison with pdist result

np.allclose(cdist,pdist) # verify that both computations give the same result

timeit torch.cdist(x_unsqueeze,x_unsqueeze).squeeze().cpu().numpy()
13.1 ms ± 8.17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

timeit torch.pdist(x).cpu().numpy()
215 ms ± 1.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)