Cdist does not return zero for distance between same vectors

yamsk · May 15, 2022, 10:22am

Hi:

Sample code

a = torch.rand(64, 784) # simulating a batch of flattened 28x28 images
print(a.shape, a)
dist = torch.cdist(a, a, p=2)
print('Diagnol distances')
for i in range(10):
    print(dist[i,i])

Result

torch.Size([64, 784]) tensor([[0.7266, 0.4859, 0.2753,  ..., 0.2172, 0.7718, 0.1553],
        [0.3704, 0.5248, 0.3265,  ..., 0.5382, 0.1589, 0.8711],
        [0.4320, 0.1686, 0.0767,  ..., 0.0733, 0.2244, 0.4947],
        ...,
        [0.8390, 0.0061, 0.2814,  ..., 0.4127, 0.4423, 0.3151],
        [0.3753, 0.3822, 0.8913,  ..., 0.8308, 0.0026, 0.7139],
        [0.1975, 0.2592, 0.4194,  ..., 0.5257, 0.4047, 0.2934]])
Diagnol distances
tensor(0.)
tensor(0.0055)
tensor(0.0055)
tensor(0.0068)
tensor(0.)
tensor(0.0078)
tensor(0.)
tensor(0.)
tensor(0.0055)
tensor(0.)

I would expect the diagnol distances to be 0.0? But I get values such as 0.0055 which is close to zero but not quite? Why is this the case?

Thanks!

InnovArul · May 15, 2022, 1:29pm

You might be hitting the below issue.

github.com/pytorch/pytorch

torch.cdist returns high diagonal values with CUDA

opened 10:26PM - 05 May 21 UTC

mbahri

module: numerical-stability triaged module: tf32

## 🐛 Bug In some cases, torch.cdist returns non-zero (i.e. far from machine e…psilon) diagonal values with CUDA. Behaviour is as expected on CPU. The issue seems more severe on Ampere GPUs. ## To Reproduce Steps to reproduce the behavior: 1. Download the following test data: https://github.com/mbahri/pytorch_bugreport/blob/6ed0186451723fd17dfb40c7086cbe31ca4d03b6/cdist/example_data.pth ``` X = torch.load('example_data.pth') # Loads directly to cuda:0 X_cpu = X.cpu() D = torch.cdist(X, X) D_cpu = torch.cdist(X_cpu, X_cpu) D[0][0,0] # On a 2080 Ti / Titan RTX # tensor(0.0055, device='cuda:0') # On an RTX 3090 # tensor(0.2271, device='cuda:0') D_cpu[0][0,0] # tensor(0.) ``` ## Expected behavior The diagonal elements should be as close to 0 as possible. The values observed here (~0.23 on the RTX 3090 and ~5.5e-3 on the 2080 Ti) are too high. ## Environment Machine with the 3090: Note: the CUDNN version is 8.0.5 in /usr/local/cuda-11.1_cudnn_8.0.5 ``` Collecting environment information... PyTorch version: 1.8.1 Is debug build: False CUDA used to build PyTorch: 11.1 ROCM used to build PyTorch: N/A OS: Ubuntu 20.10 (x86_64) GCC version: (Ubuntu 10.2.0-13ubuntu1) 10.2.0 Clang version: Could not collect CMake version: version 3.16.3 Python version: 3.8 (64-bit runtime) Is CUDA available: True CUDA runtime version: 11.1.105 GPU models and configuration: GPU 0: GeForce RTX 3090 Nvidia driver version: 460.73.01 cuDNN version: Probably one of the following: /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn.so.8 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8 HIP runtime version: N/A MIOpen runtime version: N/A Versions of relevant libraries: [pip3] numpy==1.20.1 [pip3] torch==1.8.1 [pip3] torchaudio==0.8.0a0+e4e171a [pip3] torchvision==0.9.1 [conda] blas 1.0 mkl [conda] cudatoolkit 11.1.74 h6bb024c_0 nvidia [conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] mkl 2021.2.0 h06a4308_296 [conda] mkl-service 2.3.0 py38h27cfd23_1 [conda] mkl_fft 1.3.0 py38h42c9631_2 [conda] mkl_random 1.2.1 py38ha9443f7_2 [conda] numpy 1.20.1 py38h93e21f0_0 [conda] numpy-base 1.20.1 py38h7d8b39e_0 [conda] pytorch 1.8.1 py3.8_cuda11.1_cudnn8.0.5_0 pytorch [conda] torchaudio 0.8.1 py38 pytorch [conda] torchvision 0.9.1 py38_cu111 pytorch ``` Machine with the 2080 Ti/Titan RTX ``` Collecting environment information... PyTorch version: 1.8.1 Is debug build: False CUDA used to build PyTorch: 11.1 ROCM used to build PyTorch: N/A OS: Ubuntu 18.04.5 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final) CMake version: version 3.20.1 Python version: 3.8 (64-bit runtime) Is CUDA available: True CUDA runtime version: 11.1.105 GPU models and configuration: GPU 0: GeForce RTX 2080 Ti GPU 1: TITAN RTX GPU 2: GeForce RTX 2080 Ti Nvidia driver version: 450.119.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Versions of relevant libraries: [pip3] numpy==1.19.2 [pip3] pytorch-lightning==1.2.7 [pip3] pytorch3d==0.4.0 [pip3] torch==1.8.1 [pip3] torch-cluster==1.5.9 [pip3] torch-geometric==1.7.0 [pip3] torch-scatter==2.0.6 [pip3] torch-sparse==0.6.9 [pip3] torch-spline-conv==1.2.1 [pip3] torchaudio==0.8.0a0+e4e171a [pip3] torchmetrics==0.2.0 [pip3] torchvision==0.9.1 [conda] blas 1.0 mkl [conda] cudatoolkit 11.1.1 h6406543_8 conda-forge [conda] mkl 2020.2 256 [conda] mkl-service 2.3.0 py38h1e0a361_2 conda-forge [conda] mkl_fft 1.3.0 py38h54f3939_0 [conda] mkl_random 1.2.0 py38hc5bc63f_1 conda-forge [conda] numpy 1.19.2 py38h54aff64_0 [conda] numpy-base 1.19.2 py38hfa32c7d_0 [conda] pytorch 1.8.1 py3.8_cuda11.1_cudnn8.0.5_0 pytorch [conda] pytorch-lightning 1.2.7 pyhd8ed1ab_0 conda-forge [conda] pytorch3d 0.4.0 pypi_0 pypi [conda] torch-cluster 1.5.9 pypi_0 pypi [conda] torch-geometric 1.7.0 pypi_0 pypi [conda] torch-scatter 2.0.6 pypi_0 pypi [conda] torch-sparse 0.6.9 pypi_0 pypi [conda] torch-spline-conv 1.2.1 pypi_0 pypi [conda] torchaudio 0.8.1 py38 pytorch [conda] torchmetrics 0.2.0 pyhd8ed1ab_0 conda-forge [conda] torchvision 0.9.1 py38_cu111 pytorch ``` - Any other relevant information: ## Additional context The values returned by torch.cdist also differ from those returned by torch.pdist. cc @zasdfgbnm @ptrblck

UPDATE: there is an option compute_mode='donot_use_mm_for_euclid_dist' in cdist() to not use matmul while computing cdist(). Its able to give 0 distance in diagonals.