Column Wise Dot-Product torch.einsum not matching torch.sum(torch.mul(), axis=0)

I am trying the perform a dot product between the columns of two tensors. I am trying to do this in the most efficient way possible. However, my two methods are not matching up.

My first method using torch.sum(torch.mul(a, b), axis=0) gives me my expected results, torch.einsum('ji, ji -> i', a, b) (take from Efficient method to compute the row-wise dot product of two square matrices of the same size in PyTorch - Stack Overflow) does not. The reproducible code is below:

import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

a = torch.randn(3,1, dtype=torch.float).to(device)
b = torch.randn(3,4, dtype=torch.float).to(device)

print(f"a : \n{a}\n")
print(f"b : \n{b}\n")
print(f"Expected:    {a[0,0]*b[0,0] + a[1,0]*b[1,0] + a[2,0]*b[2,0]}")

c = torch.sum(torch.mul(a, b), axis=0)
print(f"sum and mul: {c[0].item()}")

d = torch.einsum('ji, ji -> i', a, b)
print(f"einsum:      {d[0].item()}\n")


The output is:
enter image description here

On the CPU (all I did was remove the .to(device)) the last line torch.eq(c,d) is all true however, I need the tensors to be on the GPU.

Also for some seeds such as torch.manual_seed(100) the tensor are equal…

I feel like it has to be something with einsum because I can get my expect answer other ways.

I think these small discrepancies will be expected given that float32 only has 7-8 digits of precision anyway. You should check that the difference is much smaller with float64.

Yes, using dtype=torch.float64 made the difference smaller but do you happen to know why there is a discrepancy at all between using torch.sum(torch.mul(a, b), axis=0) and torch.einsum('ji, ji -> i', a, b)?

torch.einsum computes the same quantity using a different sequence of operations, in this case - viewing reshaping + batched matrix multiply.