Unable to obtain deterministic training

michaelshiyu · July 19, 2019, 8:35pm

Hi all,

I was using the NVIDIA/pix2pixHD and trying to make the training deterministic. Here’s what I set in the beginning of the main training entry script:

random.seed(123)
torch.manual_seed(123)
np.random.seed(123)
torch.cuda.manual_seed_all(123)
torch.backends.cudnn.deterministic=True
torch.backends.cudnn.benchmark = False
os.environ['PYTHONHASHSEED'] = str(123)

But the training was not deterministic and the discrepancy was too large to be caused by floating point errors.

Here’s the output from one run:

(epoch: 1, iters: 1, time: 4.743) G_GAN: 4.459 G_GAN_Feat: 12.911 G_VGG: 10.349 D_real: 4.135 D_fake: 2.747 
(epoch: 1, iters: 2, time: 0.698) G_GAN: 21.762 G_GAN_Feat: 12.063 G_VGG: 8.263 D_real: 26.311 D_fake: 21.320 
(epoch: 1, iters: 3, time: 0.696) G_GAN: 7.435 G_GAN_Feat: 12.372 G_VGG: 8.679 D_real: 8.118 D_fake: 7.287 
(epoch: 1, iters: 4, time: 0.693) G_GAN: 7.552 G_GAN_Feat: 11.537 G_VGG: 7.424 D_real: 6.559 D_fake: 7.466 
(epoch: 1, iters: 5, time: 0.697) G_GAN: 7.928 G_GAN_Feat: 10.878 G_VGG: 8.100 D_real: 6.153 D_fake: 7.848 
(epoch: 1, iters: 6, time: 0.697) G_GAN: 4.726 G_GAN_Feat: 10.149 G_VGG: 8.180 D_real: 3.945 D_fake: 4.801

Here’s another run:

(epoch: 1, iters: 1, time: 4.797) G_GAN: 4.459 G_GAN_Feat: 12.911 G_VGG: 10.349 D_real: 4.135 D_fake: 2.747 
(epoch: 1, iters: 2, time: 0.703) G_GAN: 21.762 G_GAN_Feat: 12.063 G_VGG: 8.263 D_real: 26.311 D_fake: 21.320 
(epoch: 1, iters: 3, time: 0.702) G_GAN: 7.434 G_GAN_Feat: 12.372 G_VGG: 8.679 D_real: 8.118 D_fake: 7.287 
(epoch: 1, iters: 4, time: 0.701) G_GAN: 7.590 G_GAN_Feat: 11.538 G_VGG: 7.420 D_real: 6.560 D_fake: 7.503 
(epoch: 1, iters: 5, time: 0.702) G_GAN: 7.839 G_GAN_Feat: 10.882 G_VGG: 8.096 D_real: 6.145 D_fake: 7.759 
(epoch: 1, iters: 6, time: 0.702) G_GAN: 4.819 G_GAN_Feat: 10.133 G_VGG: 8.170 D_real: 3.960 D_fake: 4.899

Note that the above two runs were obtained with num_workers=0 when creating the data loaders. With num_workers>0, the randomness persists.

The original implementation had a bug with using CPU and I tried on my own fork (which has some modifications compared to the original one) and training on CPU is fully deterministic. But my own fork also cannot train deterministically when on GPU. Also, if it matters, I am using a single 2080Ti and CUDA 10.0 for these trainings.

Honestly, I am not sure if this is a pix2pixHD issue or a PyTorch one. But since I was able to obtain deterministic training on CPU, I am suspecting that the problem might be with PyTorch.

Any help would be greatly appreciated. Many thanks in advance!

tom · July 19, 2019, 9:09pm

Have you looked at the notes on reproducibility?

Best regards

Thomas

michaelshiyu · July 19, 2019, 9:35pm

Hi Thomas,

I just read it and saw this:

There are some PyTorch functions that use CUDA functions that can be a source of non-determinism. One class of such CUDA functions are atomic operations, in particular atomicAdd , where the order of parallel additions to the same value is undetermined and, for floating-point variables, a source of variance in the result. PyTorch functions that use atomicAdd in the forward include torch.Tensor.index_add_() , torch.Tensor.scatter_add_() , torch.bincount() .

A number of operations have backwards that use atomicAdd , in particular torch.nn.functional.embedding_bag() , torch.nn.functional.ctc_loss() and many forms of pooling, padding, and sampling. There currently is no simple way of avoiding non-determinism in these functions.

Just to make sure, I wonder if this means that as long as my script used one of these operations, there would be no way to guarantee deterministic training? Not even with torch.backends.cudnn.deterministic=True?

Thanks.

tom · July 19, 2019, 10:06pm

Mostly, yes.

Best regards

Thomas

michaelshiyu · July 19, 2019, 10:35pm

I see.

…and many forms of pooling, padding, and sampling.

Is there a full list of these operations hosted somewhere?

Thanks.

karaspd · December 3, 2019, 10:05pm

It would be great if there would be a full list of operations that might not get reproducible even by enabling cudnn.detereministic flag.
I am using torch.nn.CTCLoss with cpu the results are reproducible but with gpu and following setting the results from two experiments slowly get diverging. I just want to make sure there is no way for me to get reproducible training (loss values, parameter values) as long as I am using ctc loss.

    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

    import numpy as np
    np.random.seed(seed)
    import os
    os.environ["PYTHONHASHSEED"] = "0"

    import random
    random.seed(seed)

    cudnn.deterministic = True
    cudnn.benchmark = False

Thanks

michaelshiyu · December 6, 2019, 6:33pm

I feel like right now the only way to figure this out is to actually look at the C++ code, which is a bummer for people that only want to deal with the Python interface of PyTorch