i need to enable fast computations using single GPU.

```
def load_points_cacl_distance(self, path):
points = pd.read_csv(path, sep='\t', usecols=[0, 1])
max_id = len(points)
device = torch.device("cuda")
d = pd.DataFrame(np.zeros((max_id, max_id)))
tensor1 = torch.from_numpy(d.values)
dis = sch.distance.pdist(points, 'euclidean')
tensor2 = torch.from_numpy(dis)
print(dis)
n = 0
for i in range(max_id):
print(i)
for j in range(i + 1, max_id):
d.at[i, j] = tensor2[n].to(device)
d.at[j, i] = tensor1[i, j].to(device)
n += 1
```

but the output is still slow