Thanks, you are right about the float64. The number of different digits is similar (depends on the experiment), but they are way more closer numbers.
import numpy as np
import torch
a = torch.from_numpy(np.random.rand(5000,100000).astype(np.float64))
b = torch.from_numpy(np.random.rand(5000,100000).astype(np.float64))
c = a.cuda()
d = b.cuda()
print(a.dot(b))
print(c.dot(d))::::::::::
:<EOF>
125000868.65247717
125000868.65247723