Non deterministic results cudnn

HEIMDAL13 · May 28, 2018, 11:20am

Hi,

I’m trying to train a network using cudnn but at every execution I’m getting different results. I have no idea why, as I’m trying to ensure determinism in every way I know.

Here is what I’m currently using to do so:

torch.manual_seed(0)
numpy.random.seed(0)
random.seed(0)
torch.backends.cudnn.deterministic = True

also I use num_workers=0 in the dataloader and I have manually checked that the input data on the network is always the same in every execution.

The parameters of the network also are also initialized in the same way, but as soon as the second/thirds batch comes in, some parameters and outputs of the network start to change sligthly leading to diferent training results.

Am I missing something?

Thanks.

hughperkins · May 28, 2018, 11:58am

I’m also struggling with reproducibility, and I’m interested to see what the solution(s) discovered by this thread are. By the way, did you try checking with cpu, and seeing if the cpu version is more reproducible?

ptrblck · May 28, 2018, 12:00pm

If you are sampling random numbers on the GPU, you might have to set the torch.cuda.manual_seed.
Have a look at this example code.

hughperkins · May 28, 2018, 12:23pm

I found that manual seed set both for me. (At least, on 0.4.0). https://pytorch.org/docs/stable/_modules/torch/random.html#manual_seed

ptrblck · May 28, 2018, 12:28pm

You are right! The GPU should be seeded using torch.manual_seed.

HEIMDAL13 · May 28, 2018, 1:31pm

Yes, using the CPU I always get the same results. So it must be something to do with cudnn.

hughperkins · May 28, 2018, 6:16pm

Idea: provide a short piece of code that is sufficient to reproduce the issue reliably.

HEIMDAL13 · May 29, 2018, 11:04am

It’s a very large network, so it is going to be very dificult for me to reproduce the issue with a short piece of code.

But I’m getting closser to the issue, as changing the tensor type to double instead of float using:

torch.set_default_tensor_type('torch.DoubleTensor')

solves the issue and allows me to get deterministic results.

I’m still searching why is that happening.

wminshew · September 20, 2018, 8:44pm

Know this convo is a little old but I’m under the impression there’s some non-determinism in a few cuDNN operations, like atomic adds on floating points? Might be the issue here

https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#reproducibility