After many operations, the results are below (I am trying to reproduce the results). What I find is that FMA > parallel > serial when concerning computation precision.
Is this say that gpu mode is more precise than cpu mode? But from the result cpu mode’s accuracy is more higher. And what can I do to narrow the gap between gpu mode and cpu mode? In addition, what can I do to reduce precision loss?
# cpu mode, device = torch.device("cpu") -104.9049, -2.0102, -56.8038,
# gpu mode, device = torch.device("cuda") -109.6338, -16.2780, 44.1723,