The same code and the same parameters, but got different results

Derek · November 21, 2019, 2:03am

Put the same code(model and training) and the same parameters on two GPUs(two same GPUs on one server).
But the different results appear. Why?
In the code, we set the random seed using the following code:

CUDA = torch.cuda.is_available()
import random
random.seed(123)
np.random.seed(123)
torch.manual_seed(123)
if CUDA:
torch.cuda.manual_seed(123)
torch.cuda.manual_seed_all(123)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

How to reproduce the experiment?

Here, the two figures from tensorboard as the following show different results.

ptrblck · November 21, 2019, 2:09am

Not all CUDA ops are currently deterministic in PyTorch as explained in the Reproducibility docs. Could you check, if you are using such operations in your model?

Derek · November 21, 2019, 2:22am

Thanks for quick response.
There is a torch.Tensor.scatter_add_() operation in the model code.
Does this operation take such great difference?

Does it mean if there is torch.Tensor.scatter_add_() in the model, the result will not be reproduced?
The torch.Tensor.scatter_add_() function is needed in the model. How to solve this problem?

ptrblck · November 21, 2019, 4:03am

The difference is usually in the range of floating point precision, which can accumulate over time.
If a bitwise accuracy is needed, you would have to work around scatter_add_ using some indexing or push this operation to the CPU.