Reproducibility Issue of torch.tensor.exponential_()

Kevin96 · February 20, 2023, 5:04am

Hi,

I recently had a reproducibility issue. Specifically, the same pytorch model outputs very different results on different machines, even though the random seed is fixed. The absolute value of the difference is quite large.

After tracking down the issue, it seems to me that torch.tensor.exponential_() is not deterministic even with fixed random seed.

The following is a MWE. The PyTorch version is 1.13.0.

import torch

"""
The following two tensors reproduce the inconsistency issue.
"""
logits = torch.zeros(100, 13, 97, dtype=torch.float64, device='cuda:0')
# logits = torch.zeros(1000, 1000, dtype=torch.float64, device='cuda:0')

"""
The shape MATTERS..
The following shapes do not reproduce the issue.
"""
# logits = torch.zeros(100, 10, 97, dtype=torch.float64, device='cuda:0')
# logits = torch.zeros(100, 97, dtype=torch.float64, device='cuda:0')

torch.manual_seed(0)
# g_cuda = torch.Generator(device='cuda:0')
# g_cuda.manual_seed(0)

sample = torch.empty_like(logits).exponential_()

# sample[0] is deterministic across different machines
print(sample[0].sum())

# However, sample[-1] is different on different machines
print(sample[-1].sum())

On a RTX 3090 GPU, the output is

tensor(1246.9592, device='cuda:0', dtype=torch.float64)
tensor(1304.7903, device='cuda:0', dtype=torch.float64)

On a A5000 GPU, the output is

tensor(1246.9592, device='cuda:0', dtype=torch.float64)
tensor(1229.9033, device='cuda:0', dtype=torch.float64)

Can someone advises on this? Is this potentially a bug?

ptrblck · February 20, 2023, 9:57am

There is generally no guarantee to get bitwise-identical values on different devices or using different releases.
Are you seeing deterministic results for each setup separately or does the result also differ?

Kevin96 · February 20, 2023, 4:56pm

Hi @ptrblck

On the same machine with the same PyTorch version, each run is deterministic. The non-determinism only happens when deploying the code on different machines.

From what I understand, once the random seed is fixed, all samples should be the same (bitwise-identical) up to floating point precision. In the above case, I used double precision to make sure the numerical errors of floating point computation is minimal. Yet they are still quite different (this difference eventually causes huge performance difference of the model).

Also, it is very bizarre that the non-determinism only happens if the logits tensor has certain size. In fact, in the above example, the majority of the samples are bitwise-identical. Only the last few rows have different values.

Is this phenomenon expected? And is there any explanation accounting for this?

Thanks!

ptrblck · February 20, 2023, 8:49pm

That’s not the case as different hardware could use different code paths in compute libraries, e.g. cuRAND in this case.

If your training is sensitive to a specific seeded run I would also assume you would see divergence on the same machine when different seeds are used?
In this case I would recommend trying to stabilize your training to avoid such a dependency on the seed and setup.

I wouldn’t call it non-determinism, since you are getting deterministic results on each platform. The change starting from a specific shape could indicate my aforementioned point of taking different code paths (e.g. vectorized vs. non-vectorized depending on the shape).

Kevin96 · February 21, 2023, 8:00pm

Hi @ptrblck

Thanks for your answer!

The model I am running is a pretrained VAE. The samples generated from the VAE is then feed into a Bayesian optimization algorithm to maximize a black-box function. The Bayesian optimization algorithm does not diverge, and AFAIK Bayesian optimization can be sensitive to randomness sometimes.

I certainly I think the algorithm should be made less sensitive to random seeds. But in the mean time, is there any way to mitigate the randomness of by fixing the code path in cuRAND? (something like torch.backends.cudnn.deterministic = True for cuRAND)

I know one way to fix it is to do sampling on CPU first and then move the samples to GPU. But just wondering if there is any way to do it directly on GPU without transferring data between CPU and GPU.

Thanks!