I am trying the perform a dot product between the columns of two tensors. I am trying to do this in the most efficient way possible. However, my two methods are not matching up.

My first method using `torch.sum(torch.mul(a, b), axis=0)`

gives me my expected results, `torch.einsum('ji, ji -> i', a, b)`

(take from Efficient method to compute the row-wise dot product of two square matrices of the same size in PyTorch - Stack Overflow) does not. The reproducible code is below:

```
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch.manual_seed(0)
a = torch.randn(3,1, dtype=torch.float).to(device)
b = torch.randn(3,4, dtype=torch.float).to(device)
print(f"a : \n{a}\n")
print(f"b : \n{b}\n")
print(f"Expected: {a[0,0]*b[0,0] + a[1,0]*b[1,0] + a[2,0]*b[2,0]}")
c = torch.sum(torch.mul(a, b), axis=0)
print(f"sum and mul: {c[0].item()}")
d = torch.einsum('ji, ji -> i', a, b)
print(f"einsum: {d[0].item()}\n")
print(torch.eq(c,d))
```

Notes:

On the CPU (all I did was remove the `.to(device)`

) the last line `torch.eq(c,d)`

is all true however, I need the tensors to be on the GPU.

Also for some seeds such as `torch.manual_seed(100)`

the tensor are equalâ€¦

I feel like it has to be something with einsum because I can get my expect answer other ways.