Different outputs for `torch.pdist` between gpu and cpu

The code snippet is shown as follows:

a = torch.randn(100, 20)
b = torch.pdist(a)
c = torch.pdist(a.cuda()).cpu()
print(torch.sum(torch.abs(b - c)))  # tensor(0.0007)

The output difference is quite large between gpu and cpu computation. What’s the cause of it?