A error between `torch.cdist` and manually unsqueezing tensors

As I run some code like these which do the same thing $-\sum ||x-w||$:

def via_unsqueeze(W_col, X_col):
    output = -(W_col.unsqueeze(2) - X_col.unsqueeze(0)).abs().sum(1)
    return output 

and

def via_cdist(W, X):
    output = -torch.cdist(W, X, 1)
    return output

Sometimes, there will be a difference up to 2e-6 when the X, W are under the normal distribution.

Then check them via a test:

def a(): 
    W = torch.rand(1000,300) 
    X = torch.rand(1000,300) 
    return (via_unsqueeze(W,X) - via_cdist(W,X)).abs() 

The output could be like

tensor([[0.0000e+00, 3.8147e-05, 2.2888e-05,  ..., 2.2888e-05, 3.8147e-05,
         1.5259e-05],
        [3.0518e-05, 1.5259e-05, 0.0000e+00,  ..., 0.0000e+00, 3.0518e-05,
         2.2888e-05],
        [3.0518e-05, 6.8665e-05, 1.5259e-05,  ..., 7.6294e-06, 1.5259e-05,
         7.6294e-06],
        ...,
        [7.6294e-06, 1.5259e-05, 1.5259e-05,  ..., 3.0518e-05, 1.5259e-05,
         3.8147e-05],
        [3.0518e-05, 7.6294e-06, 2.2888e-05,  ..., 4.5776e-05, 7.6294e-06,
         7.6294e-06],
        [3.0518e-05, 7.6294e-06, 3.8147e-05,  ..., 1.5259e-05, 3.8147e-05,
         3.0518e-05]])

So I am wondering HOW THIS HAPPENED and how to fix it?

Thanks in advance.

The difference is most likely caused by the limited floating point precision and a different order of operations. You could use torch.float64 as the data type in case you need to increase the precision, but note that this data type would cause a performance hit on GPUs.
A simple example is given here:

x = torch.randn(100, 100)
sum1 = x.sum()
sum2 = x.sum(0).sum(0)
print((sum1 - sum2).abs().max())
> tensor(2.2888e-05)

x = torch.randn(100, 100, dtype=torch.float64)
sum1 = x.sum()
sum2 = x.sum(0).sum(0)
print((sum1 - sum2).abs().max())
> tensor(2.1316e-14, dtype=torch.float64)

Thank you for the answer.