I have 8 tasks running on 8 GPUs, potentially concurrently through a Jupyter notebook. I would like each task to be deterministic, so I call torch.cuda.manual_seed_all(seed) in the beginning of each task. However, the tasks are not deterministic, and I believe this is because GPUs are potentially being re-seeded during runtime by another task being launched. Am I correct in this assumption? If so, how do I fix this behavior? If not, what’s the issue? Thanks so much!
Based on the Reproducibility docs, you would have to diable cudnn.benchmark
and set cudnn.deterministic=True
besides seeding.
Also, note that some CUDA operations are non-deterministic:
There are some PyTorch functions that use CUDA functions that can be a source of non-determinism. One class of such CUDA functions are atomic operations, in particular
atomicAdd
, where the order of parallel additions to the same value is undetermined and, for floating-point variables, a source of variance in the result. PyTorch functions that useatomicAdd
in the forward includetorch.Tensor.index_add_()
,torch.Tensor.scatter_add_()
,torch.bincount()
.
A number of operations have backwards that use
atomicAdd
, in particulartorch.nn.functional.embedding_bag()
,torch.nn.functional.ctc_loss()
and many forms of pooling, padding, and sampling. There currently is no simple way of avoiding non-determinism in these functions.
I’m particularly concerned about accidentally re-seeding all 8 GPUs while another task is in the process of running, thus introducing non-deterministic behavior due to the seed reset. Is this possible?
The seed should be local to the current application.
Did you see any weird behavior?