What are the best practises to ensure reproducibility across GPUs/devices?

I have a NN training problem which produces radically different losses/errors depending on the GPU/device that is being used. This is despite setting the default dtype to float64.

Of course I have read the Pytorch documentation and set all the seeds as recommended.

But I read in this thread that one should use x = x/2 instead of x/= 2, which apparently solved the user’s problems with getting different results.

This isn’t documented anywhere, so I’m wondering what other things there are that I’m using that shouldn’t be used if I want to achieve very similar results on different GPUs and devices. Does anyone know any other tricks or things I need to look out for?

In place operations are not causing issues by themselves. Read through the linked post and you will see that the input data of the user was manipulated via the inplace division on the CPU but not using the GPU as the user created copies of the inputs when the GPU was used.

Could he have avoided the problem if he had written X_test = X_test.to(device) directly after he defined it? Trying to understand the cause of the issue

Edit: yes, it looks like this solves the issue in the sense that it produces consistent output, but it’s different to the OP’s solution’s answer.

But I am still a bit concerned that there are other use cases which causes these differences between GPU/CPU that haven’t been documented very well.