I’m trying to train a network using cudnn but at every execution I’m getting different results. I have no idea why, as I’m trying to ensure determinism in every way I know.
Here is what I’m currently using to do so:
torch.backends.cudnn.deterministic = True
also I use num_workers=0 in the dataloader and I have manually checked that the input data on the network is always the same in every execution.
The parameters of the network also are also initialized in the same way, but as soon as the second/thirds batch comes in, some parameters and outputs of the network start to change sligthly leading to diferent training results.
Am I missing something?
I’m also struggling with reproducibility, and I’m interested to see what the solution(s) discovered by this thread are. By the way, did you try checking with cpu, and seeing if the cpu version is more reproducible?
If you are sampling random numbers on the GPU, you might have to set the
Have a look at this example code.
You are right! The GPU should be seeded using
Yes, using the CPU I always get the same results. So it must be something to do with cudnn.
Idea: provide a short piece of code that is sufficient to reproduce the issue reliably.
It’s a very large network, so it is going to be very dificult for me to reproduce the issue with a short piece of code.
But I’m getting closser to the issue, as changing the tensor type to double instead of float using:
solves the issue and allows me to get deterministic results.
I’m still searching why is that happening.
Know this convo is a little old but I’m under the impression there’s some non-determinism in a few cuDNN operations, like atomic adds on floating points? Might be the issue here