Hi,
I recently had a reproducibility issue. Specifically, the same pytorch model outputs very different results on different machines, even though the random seed is fixed. The absolute value of the difference is quite large.
After tracking down the issue, it seems to me that torch.tensor.exponential_()
is not deterministic even with fixed random seed.
The following is a MWE. The PyTorch version is 1.13.0.
import torch
"""
The following two tensors reproduce the inconsistency issue.
"""
logits = torch.zeros(100, 13, 97, dtype=torch.float64, device='cuda:0')
# logits = torch.zeros(1000, 1000, dtype=torch.float64, device='cuda:0')
"""
The shape MATTERS..
The following shapes do not reproduce the issue.
"""
# logits = torch.zeros(100, 10, 97, dtype=torch.float64, device='cuda:0')
# logits = torch.zeros(100, 97, dtype=torch.float64, device='cuda:0')
torch.manual_seed(0)
# g_cuda = torch.Generator(device='cuda:0')
# g_cuda.manual_seed(0)
sample = torch.empty_like(logits).exponential_()
# sample[0] is deterministic across different machines
print(sample[0].sum())
# However, sample[-1] is different on different machines
print(sample[-1].sum())
On a RTX 3090 GPU, the output is
tensor(1246.9592, device='cuda:0', dtype=torch.float64)
tensor(1304.7903, device='cuda:0', dtype=torch.float64)
On a A5000 GPU, the output is
tensor(1246.9592, device='cuda:0', dtype=torch.float64)
tensor(1229.9033, device='cuda:0', dtype=torch.float64)
Can someone advises on this? Is this potentially a bug?