Computation precision differs a lot between gpu mode and cpu mode

After many operations, the results are below (I am trying to reproduce the results). What I find is that FMA > parallel > serial when concerning computation precision.
Is this say that gpu mode is more precise than cpu mode? But from the result cpu mode’s accuracy is more higher. And what can I do to narrow the gap between gpu mode and cpu mode? In addition, what can I do to reduce precision loss?

# cpu mode, device = torch.device("cpu")
-104.9049,   -2.0102,  -56.8038,
# gpu mode, device = torch.device("cuda")
-109.6338,  -16.2780,   44.1723,