Hi all,
I was using the NVIDIA/pix2pixHD and trying to make the training deterministic. Here’s what I set in the beginning of the main training entry script:
random.seed(123)
torch.manual_seed(123)
np.random.seed(123)
torch.cuda.manual_seed_all(123)
torch.backends.cudnn.deterministic=True
torch.backends.cudnn.benchmark = False
os.environ['PYTHONHASHSEED'] = str(123)
But the training was not deterministic and the discrepancy was too large to be caused by floating point errors.
Here’s the output from one run:
(epoch: 1, iters: 1, time: 4.743) G_GAN: 4.459 G_GAN_Feat: 12.911 G_VGG: 10.349 D_real: 4.135 D_fake: 2.747
(epoch: 1, iters: 2, time: 0.698) G_GAN: 21.762 G_GAN_Feat: 12.063 G_VGG: 8.263 D_real: 26.311 D_fake: 21.320
(epoch: 1, iters: 3, time: 0.696) G_GAN: 7.435 G_GAN_Feat: 12.372 G_VGG: 8.679 D_real: 8.118 D_fake: 7.287
(epoch: 1, iters: 4, time: 0.693) G_GAN: 7.552 G_GAN_Feat: 11.537 G_VGG: 7.424 D_real: 6.559 D_fake: 7.466
(epoch: 1, iters: 5, time: 0.697) G_GAN: 7.928 G_GAN_Feat: 10.878 G_VGG: 8.100 D_real: 6.153 D_fake: 7.848
(epoch: 1, iters: 6, time: 0.697) G_GAN: 4.726 G_GAN_Feat: 10.149 G_VGG: 8.180 D_real: 3.945 D_fake: 4.801
Here’s another run:
(epoch: 1, iters: 1, time: 4.797) G_GAN: 4.459 G_GAN_Feat: 12.911 G_VGG: 10.349 D_real: 4.135 D_fake: 2.747
(epoch: 1, iters: 2, time: 0.703) G_GAN: 21.762 G_GAN_Feat: 12.063 G_VGG: 8.263 D_real: 26.311 D_fake: 21.320
(epoch: 1, iters: 3, time: 0.702) G_GAN: 7.434 G_GAN_Feat: 12.372 G_VGG: 8.679 D_real: 8.118 D_fake: 7.287
(epoch: 1, iters: 4, time: 0.701) G_GAN: 7.590 G_GAN_Feat: 11.538 G_VGG: 7.420 D_real: 6.560 D_fake: 7.503
(epoch: 1, iters: 5, time: 0.702) G_GAN: 7.839 G_GAN_Feat: 10.882 G_VGG: 8.096 D_real: 6.145 D_fake: 7.759
(epoch: 1, iters: 6, time: 0.702) G_GAN: 4.819 G_GAN_Feat: 10.133 G_VGG: 8.170 D_real: 3.960 D_fake: 4.899
Note that the above two runs were obtained with num_workers=0
when creating the data loaders. With num_workers>0
, the randomness persists.
The original implementation had a bug with using CPU and I tried on my own fork (which has some modifications compared to the original one) and training on CPU is fully deterministic. But my own fork also cannot train deterministically when on GPU. Also, if it matters, I am using a single 2080Ti and CUDA 10.0 for these trainings.
Honestly, I am not sure if this is a pix2pixHD issue or a PyTorch one. But since I was able to obtain deterministic training on CPU, I am suspecting that the problem might be with PyTorch.
Any help would be greatly appreciated. Many thanks in advance!