The same code and the same parameters, but got different results

Put the same code(model and training) and the same parameters on two GPUs(two same GPUs on one server).
But the different results appear. Why?
In the code, we set the random seed using the following code:

CUDA = torch.cuda.is_available()
import random
if CUDA:
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

How to reproduce the experiment?

Here, the two figures from tensorboard as the following show different results.

Not all CUDA ops are currently deterministic in PyTorch as explained in the Reproducibility docs. Could you check, if you are using such operations in your model?

Thanks for quick response.
There is a torch.Tensor.scatter_add_() operation in the model code.
Does this operation take such great difference?

Does it mean if there is torch.Tensor.scatter_add_() in the model, the result will not be reproduced?
The torch.Tensor.scatter_add_() function is needed in the model. How to solve this problem?

The difference is usually in the range of floating point precision, which can accumulate over time.
If a bitwise accuracy is needed, you would have to work around scatter_add_ using some indexing or push this operation to the CPU.