Deterministic training when using mixed-precision

juliustao · January 18, 2022, 8:57pm

TL;DR: After using torch.cuda.amp, I have deterministic training, even though I set torch.backends.cudnn options as deterministic=False, benchmark=False, etc.

Environment

2080Ti (CUDA 11.2, Driver 460.91.03)
PyTorch 1.11.0.dev20211127
Python 3.9.7

I experimented with this minimal MNIST example and reproduced the nondeterminism across training runs (i.e. different epoch losses when I train from scratch multiple times). The source of nondeterminism is GPU operations since the random seeds are fixed.

If I set torch.backends.cudnn.deterministic=True, I see deterministic training like the original author.

However, if I use amp for mixed-precision training, I also see determinism even without deterministic=True.

Has anyone seen similar cases or have insights into how using amp could remove nondeterminism?

Here’s a writeup with further details on experiments and code.

Thanks!

ptrblck · January 18, 2022, 9:35pm

Non-deterministic results could be caused by the actual kernel selection and amp could be using e.g. deterministic cuDNN kernels.

juliustao · January 18, 2022, 9:38pm

Thanks for the quick response! How could I check whether the amp code is using deterministic cuDNN kernels? I know the non-determinism before using amp came from Conv2d layers preceding another Conv2d layer during backprop, if that narrows it down.

ptrblck · January 18, 2022, 9:42pm

You could profile the kernels and see if you could determine the underlying algorithm from its name.
Generally, it could be hard to guess it unless you are familiar with different algos and how they might be implemented internally. The general rule would be that kernels can be deterministic in the default setup and have to be deterministic if the right flags are set.