Pytorch reproducibility

Ferd · November 4, 2021, 12:38pm

I meet a problem about reproducibility of pytorch.
I run a network multiple times, but obtain different result each time.
Here is my setting:

torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
np.random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

I expect that the output of network should be the same every time I run the network, at least for the first few mini-batches.
But this is not the case, here are the experiment results(I choose the same gpu and seed each time, batchsize=1):

For random initialization case, loss of batch 1 is the same each time. It indicates that the initial weights are the same. However, if using Adam, the result becomes different since batch 2.

For resuming from checkpoint case(optimizer is also restored), it produces expected result.

I am confused, I don’t find any random variable in Adam, and I know how to explain the result. Is there anything I didn’t notice when initializing Adam?

Any help is appreciated.

Jim_Thompson · November 5, 2021, 3:22am

I noticed you did not mention this option setting: torch.use_deterministic_algorithms — PyTorch 1.10.0 documentation

Ferd · November 5, 2021, 6:07am

Thank you for your reply. However, it is introduced into torch since 1.8.0 and I use 1.2.0.

And I use torch.nn.functional.interpolate() in the network, so it would definitely throw a RuntimeError if I set torch.use_deterministic_algorithms to true.

But I think it is not the reason, because it reproduces the same result when I use SGD or initialize network from a certain checkpoint.

tom · November 5, 2021, 7:36am

So the errors form the nondeterminism in the PyTorch interpolate function are very small (they’re mostly from numerical precision of different orders of adding floating point numbers). Adam has the somewhat unfortunate property that the first few steps have funny scaling which can amplify random fluctuations. So you might add an Adam warmup phase with a very low learning rate.

Note that in general I would consider your results stable enough to embrace the fact that there is stochastic fluctuation. The rather weak reproducibility that PyTorch can give you is no replacement for stable procedures as other people with different hardware, different OS, different tooling would want to be able to reproduce your results, too.

Best regards

Thomas