I want reproduce my experiments by using torch.backends.cudnn.deterministic = True. In my codes, I use this function:
random.seed(arg.manual_seed)
np.random.seed(arg.manual_seed)
torch.manual_seed(arg.manual_seed)
torch.cuda.manual_seed(arg.manual_seed)
torch.cuda.manual_seed_all(arg.manual_seed) # if you are using multi-GPU.
torch.backends.cudnn.enabled = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
The PROBLEM is when training is started, I can get same outputs and loss. However, the differences are gradually increasing. How can I fix it?
run test 1:
2020-03-19 23:26:41 INFO: >> eta: 7:41:20 iter: 10 lr: 0.0100 loss: 0.8819 acc: 0.6697 Mean IOU: 0.3424 F1: 0.5079 time: 0.6153 data_id: tensor([1924, 1819, 1519, 1157])
2020-03-19 23:26:47 INFO: >> eta: 7:30:37 iter: 20 lr: 0.0100 loss: 0.9337 acc: 0.5401 Mean IOU: 0.3341 F1: 0.5149 time: 0.6011 data_id: tensor([ 824, 1299, 1290, 1147])
2020-03-19 23:26:52 INFO: >> eta: 7:12:11 iter: 30 lr: 0.0100 loss: 0.6674 acc: 0.7553 Mean IOU: 0.4709 F1: 0.5715 time: 0.5766 data_id: tensor([2125, 631, 1042, 656])
2020-03-19 23:26:58 INFO: >> eta: 7:03:26 iter: 40 lr: 0.0100 loss: 0.7282 acc: 0.6599 Mean IOU: 0.3597 F1: 0.4987 time: 0.5651 data_id: tensor([ 714, 2496, 2633, 1904])
2020-03-19 23:27:04 INFO: >> eta: 7:06:16 iter: 50 lr: 0.0100 loss: 0.8958 acc: 0.5448 Mean IOU: 0.3034 F1: 0.4547 time: 0.5690 data_id: tensor([ 406, 759, 2659, 2059])
2020-03-19 23:27:10 INFO: >> eta: 7:08:09 iter: 60 lr: 0.0100 loss: 0.7469 acc: 0.6814 Mean IOU: 0.3380 F1: 0.4817 time: 0.5716 data_id: tensor([2698, 2966, 2848, 795])
2020-03-19 23:27:16 INFO: >> eta: 7:20:39 iter: 70 lr: 0.0100 loss: 0.7930 acc: 0.6840 Mean IOU: 0.4107 F1: 0.5892 time: 0.5885 data_id: tensor([2069, 2454, 2004, 1124])
2020-03-19 23:27:21 INFO: >> eta: 7:25:11 iter: 80 lr: 0.0100 loss: 0.6677 acc: 0.8002 Mean IOU: 0.5350 F1: 0.6715 time: 0.5946 data_id: tensor([1209, 601, 2713, 762])
2020-03-19 23:27:28 INFO: >> eta: 7:24:30 iter: 90 lr: 0.0100 loss: 0.4652 acc: 0.8604 Mean IOU: 0.5154 F1: 0.5864 time: 0.5939 data_id: tensor([1840, 2739, 774, 477])
2020-03-19 23:27:33 INFO: >> eta: 7:28:41 iter: 100 lr: 0.0100 loss: 0.6062 acc: 0.7979 Mean IOU: 0.5363 F1: 0.6408 time: 0.5996 data_id: tensor([2242, 1800, 2816, 2042])
run test 2:
2020-03-19 23:27:53 INFO: >> eta: 7:12:06 iter: 10 lr: 0.0100 loss: 0.8819 acc: 0.6698 Mean IOU: 0.3424 F1: 0.5080 time: 0.5763 data_id: tensor([1924, 1819, 1519, 1157])
2020-03-19 23:27:59 INFO: >> eta: 7:28:21 iter: 20 lr: 0.0100 loss: 0.9334 acc: 0.5406 Mean IOU: 0.3348 F1: 0.5154 time: 0.5981 data_id: tensor([ 824, 1299, 1290, 1147])
2020-03-19 23:28:05 INFO: >> eta: 7:39:09 iter: 30 lr: 0.0100 loss: 0.6695 acc: 0.7579 Mean IOU: 0.4721 F1: 0.5735 time: 0.6126 data_id: tensor([2125, 631, 1042, 656])
2020-03-19 23:28:11 INFO: >> eta: 7:42:29 iter: 40 lr: 0.0100 loss: 0.7506 acc: 0.6406 Mean IOU: 0.3434 F1: 0.4767 time: 0.6172 data_id: tensor([ 714, 2496, 2633, 1904])
2020-03-19 23:28:18 INFO: >> eta: 7:45:48 iter: 50 lr: 0.0100 loss: 0.8407 acc: 0.5869 Mean IOU: 0.3338 F1: 0.4914 time: 0.6218 data_id: tensor([ 406, 759, 2659, 2059])
2020-03-19 23:28:24 INFO: >> eta: 7:47:09 iter: 60 lr: 0.0100 loss: 0.7846 acc: 0.6482 Mean IOU: 0.3253 F1: 0.4666 time: 0.6237 data_id: tensor([2698, 2966, 2848, 795])
2020-03-19 23:28:30 INFO: >> eta: 7:48:15 iter: 70 lr: 0.0100 loss: 0.7823 acc: 0.6842 Mean IOU: 0.4137 F1: 0.5920 time: 0.6253 data_id: tensor([2069, 2454, 2004, 1124])
2020-03-19 23:28:36 INFO: >> eta: 7:43:39 iter: 80 lr: 0.0100 loss: 0.7253 acc: 0.7692 Mean IOU: 0.5027 F1: 0.6540 time: 0.6193 data_id: tensor([1209, 601, 2713, 762])
2020-03-19 23:28:42 INFO: >> eta: 7:41:48 iter: 90 lr: 0.0100 loss: 0.4453 acc: 0.8587 Mean IOU: 0.5065 F1: 0.5784 time: 0.6170 data_id: tensor([1840, 2739, 774, 477])
2020-03-19 23:28:49 INFO: >> eta: 7:50:35 iter: 100 lr: 0.0100 loss: 0.6081 acc: 0.7809 Mean IOU: 0.5189 F1: 0.6269 time: 0.6289 data_id: tensor([2242, 1800, 2816, 2042])