AvgPool2d and non-deterministic results

papoo13 · March 2, 2023, 5:25pm

Is AvgPool2d is non-deterministic?
Only using this causes reproducibility issues in my model.
o have set all the seeds and determinism=True as well. when I run the code, I don’t get any errors regarding having a non-deterministic function but the results are affected. Does anyone have any idea about this? if it is, why? and is there a way to efficiently implement it?

ptrblck · March 2, 2023, 8:48pm

nn.AvgPool2d should be deterministic as no error is raised in this code:

import torch
import torch.nn as nn

torch.use_deterministic_algorithms(True)

pool = nn.AvgPool2d(2).cuda()
x = torch.randn(1, 3, 224, 224, device="cuda", requires_grad=True)

for _ in range(10):
    out = pool(x)
    out.mean().backward()
    print(out.double().abs().sum())
    print(x.grad.double().abs().sum())
    x.grad = None

and the values also do not diverge.
Some non-deterministic layers are mentioned in the docs of torch.use_deterministic_algorithms.

papoo13 · March 3, 2023, 3:07pm

Thank you for the reply. I do not face the reproducibility issue on the same GPU model but when I run my code on two different machines ( I have the same environment, seeds are set up and also determinism is True and I do not get any warning regarding non-deterministic function), I get different results. I have a Bayesian model used in continual set up and I use the learned posterior as the next prior (= part of the ELBO loss) and even tiny changes in the first posterior causes the differences to grow larger and larger in such a set up. Any idea why running the code on two different machines doesn’t lead to reproducible results even when everything has been set up?

ptrblck · March 3, 2023, 7:04pm

Bitwise-identical results between different setups are not guaranteed since e.g. different code paths could be used based on the GPU capability.
Using deterministic settings should make sure to yield deterministic outputs on the same setup.
You should also note that both setups will run into the same limit of the floating point precision and none of them is “more correct” than the other assuming both are using the same dtype.