Why does the GPU and CPU lead to different results?

It seems that the GPU and CPU can get same forward output of a certain neural network. But when it starts training and doing optimization, the results are totally different. The training loss are very different. Does anyone have the same problem before?