Deterministic behaviour with cuda support

What I understand from the docs is that whenever I intend to do conputations on the GPU, I need to add the following lines to my main script

torch.manual_seed(0)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

why isnt this included by default if torch recognizes GPU computations?

The article also mentions

When running on the CuDNN backend…

Is there a way to use pytorch on the GPU and not use CuDNN as well? oO

And what would be the consequences if I dont include these lines?

I guess what I am saying is, for an information that seems to be of such utter importance to the performance, I find it odd that it is somewhat hidden in the depth of the documentation…

Hi,

why isnt this included by default if torch recognizes GPU computations?

Because the deterministic algorithms are significantly slower than non-deterministic ones.

Is there a way to use pytorch on the GPU and not use CuDNN as well?

You can set torch.backends.cudnn.enabled = False. pytorch has all the necessary cuda kernels to run without cudnn. The big advantage of cudnn is that it has a collection of algorithms for tasks like convolutions and will choose the most suited one for the given input/weight size.

And what would be the consequences if I dont include these lines?

If you don’t include these lines, you will get the fastest algorithms but potentially non-deterministic behaviour.

Keep in mind that non-deterministic here means that both results are correct (in the sense of float operations can have multiple correct output like a + b != b + a, or an argmax can return any index of the elements achieving the max value) but which of these correct results is returned can vary from one run to the other.

1 Like

thank you for the explanations.

So I could expect the accuracy and loss to look mostly similar if I ran two times the very same script and the same data? Variations assumed only due to random-weight initialization and the shuffling of the dataloader.

You can see the non-determinism of the operations in the network as having the same effet as using a different weight initialization or a different shuffling of the dataloader !
It will give you completely different weights in the end but if your training is “stable”, the performance of the final network will be mostly similar.