Could not get same results by using torch.backends.cudnn.deterministic = True

bin_zhang · March 19, 2020, 3:46pm

I want reproduce my experiments by using torch.backends.cudnn.deterministic = True. In my codes, I use this function:

random.seed(arg.manual_seed)
np.random.seed(arg.manual_seed)
torch.manual_seed(arg.manual_seed)
torch.cuda.manual_seed(arg.manual_seed)
torch.cuda.manual_seed_all(arg.manual_seed)  # if you are using multi-GPU.
torch.backends.cudnn.enabled = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

The PROBLEM is when training is started, I can get same outputs and loss. However, the differences are gradually increasing. How can I fix it?

run test 1:

2020-03-19 23:26:41   INFO: >> eta: 7:41:20  iter: 10  lr: 0.0100  loss: 0.8819  acc: 0.6697  Mean IOU: 0.3424  F1: 0.5079  time: 0.6153  data_id: tensor([1924, 1819, 1519, 1157])
2020-03-19 23:26:47   INFO: >> eta: 7:30:37  iter: 20  lr: 0.0100  loss: 0.9337  acc: 0.5401  Mean IOU: 0.3341  F1: 0.5149  time: 0.6011  data_id: tensor([ 824, 1299, 1290, 1147])
2020-03-19 23:26:52   INFO: >> eta: 7:12:11  iter: 30  lr: 0.0100  loss: 0.6674  acc: 0.7553  Mean IOU: 0.4709  F1: 0.5715  time: 0.5766  data_id: tensor([2125,  631, 1042,  656])
2020-03-19 23:26:58   INFO: >> eta: 7:03:26  iter: 40  lr: 0.0100  loss: 0.7282  acc: 0.6599  Mean IOU: 0.3597  F1: 0.4987  time: 0.5651  data_id: tensor([ 714, 2496, 2633, 1904])
2020-03-19 23:27:04   INFO: >> eta: 7:06:16  iter: 50  lr: 0.0100  loss: 0.8958  acc: 0.5448  Mean IOU: 0.3034  F1: 0.4547  time: 0.5690  data_id: tensor([ 406,  759, 2659, 2059])
2020-03-19 23:27:10   INFO: >> eta: 7:08:09  iter: 60  lr: 0.0100  loss: 0.7469  acc: 0.6814  Mean IOU: 0.3380  F1: 0.4817  time: 0.5716  data_id: tensor([2698, 2966, 2848,  795])
2020-03-19 23:27:16   INFO: >> eta: 7:20:39  iter: 70  lr: 0.0100  loss: 0.7930  acc: 0.6840  Mean IOU: 0.4107  F1: 0.5892  time: 0.5885  data_id: tensor([2069, 2454, 2004, 1124])
2020-03-19 23:27:21   INFO: >> eta: 7:25:11  iter: 80  lr: 0.0100  loss: 0.6677  acc: 0.8002  Mean IOU: 0.5350  F1: 0.6715  time: 0.5946  data_id: tensor([1209,  601, 2713,  762])
2020-03-19 23:27:28   INFO: >> eta: 7:24:30  iter: 90  lr: 0.0100  loss: 0.4652  acc: 0.8604  Mean IOU: 0.5154  F1: 0.5864  time: 0.5939  data_id: tensor([1840, 2739,  774,  477])
2020-03-19 23:27:33   INFO: >> eta: 7:28:41  iter: 100  lr: 0.0100  loss: 0.6062  acc: 0.7979  Mean IOU: 0.5363  F1: 0.6408  time: 0.5996  data_id: tensor([2242, 1800, 2816, 2042])

run test 2:

2020-03-19 23:27:53   INFO: >> eta: 7:12:06  iter: 10  lr: 0.0100  loss: 0.8819  acc: 0.6698  Mean IOU: 0.3424  F1: 0.5080  time: 0.5763  data_id: tensor([1924, 1819, 1519, 1157])
2020-03-19 23:27:59   INFO: >> eta: 7:28:21  iter: 20  lr: 0.0100  loss: 0.9334  acc: 0.5406  Mean IOU: 0.3348  F1: 0.5154  time: 0.5981  data_id: tensor([ 824, 1299, 1290, 1147])
2020-03-19 23:28:05   INFO: >> eta: 7:39:09  iter: 30  lr: 0.0100  loss: 0.6695  acc: 0.7579  Mean IOU: 0.4721  F1: 0.5735  time: 0.6126  data_id: tensor([2125,  631, 1042,  656])
2020-03-19 23:28:11   INFO: >> eta: 7:42:29  iter: 40  lr: 0.0100  loss: 0.7506  acc: 0.6406  Mean IOU: 0.3434  F1: 0.4767  time: 0.6172  data_id: tensor([ 714, 2496, 2633, 1904])
2020-03-19 23:28:18   INFO: >> eta: 7:45:48  iter: 50  lr: 0.0100  loss: 0.8407  acc: 0.5869  Mean IOU: 0.3338  F1: 0.4914  time: 0.6218  data_id: tensor([ 406,  759, 2659, 2059])
2020-03-19 23:28:24   INFO: >> eta: 7:47:09  iter: 60  lr: 0.0100  loss: 0.7846  acc: 0.6482  Mean IOU: 0.3253  F1: 0.4666  time: 0.6237  data_id: tensor([2698, 2966, 2848,  795])
2020-03-19 23:28:30   INFO: >> eta: 7:48:15  iter: 70  lr: 0.0100  loss: 0.7823  acc: 0.6842  Mean IOU: 0.4137  F1: 0.5920  time: 0.6253  data_id: tensor([2069, 2454, 2004, 1124])
2020-03-19 23:28:36   INFO: >> eta: 7:43:39  iter: 80  lr: 0.0100  loss: 0.7253  acc: 0.7692  Mean IOU: 0.5027  F1: 0.6540  time: 0.6193  data_id: tensor([1209,  601, 2713,  762])
2020-03-19 23:28:42   INFO: >> eta: 7:41:48  iter: 90  lr: 0.0100  loss: 0.4453  acc: 0.8587  Mean IOU: 0.5065  F1: 0.5784  time: 0.6170  data_id: tensor([1840, 2739,  774,  477])
2020-03-19 23:28:49   INFO: >> eta: 7:50:35  iter: 100  lr: 0.0100  loss: 0.6081  acc: 0.7809  Mean IOU: 0.5189  F1: 0.6269  time: 0.6289  data_id: tensor([2242, 1800, 2816, 2042])

ptrblck · March 20, 2020, 4:23am

Could you have a look at the Reproducibility docs and check, if you are using one of the non-deterministic methods?

There are some PyTorch functions that use CUDA functions that can be a source of non-determinism. One class of such CUDA functions are atomic operations, in particular atomicAdd , where the order of parallel additions to the same value is undetermined and, for floating-point variables, a source of variance in the result. PyTorch functions that use atomicAdd in the forward include torch.Tensor.index_add_() , torch.Tensor.scatter_add_() , torch.bincount() .

A number of operations have backwards that use atomicAdd , in particular torch.nn.functional.embedding_bag() , torch.nn.functional.ctc_loss() and many forms of pooling, padding, and sampling. There currently is no simple way of avoiding non-determinism in these functions.

bin_zhang · March 20, 2020, 6:03am

I have checked the Reproducibility docs before. However, I do not know which operation is the non-deterministic methods.
In my codes, I used conv2d, BN, RELU, Dropout2d, cat, interpolate and cross_entropy.
Could you tell me which one is the non-deterministic methods?

ptrblck · March 20, 2020, 6:06am

What kind of interpolation are you using?

bin_zhang · March 20, 2020, 7:26am

Bilinear interpolation was used with align_corners=True. Does it have some problems?

Hao_Du · January 26, 2024, 2:25am

hi, how to solve it? now I am having the same problem, and my pytorch version is 1.7.1, but due to some reasons, I cannot update the pytorch version