Deterministic behaviour with cuda support

Carsten_Ditzel · April 4, 2019, 9:19am

What I understand from the docs is that whenever I intend to do conputations on the GPU, I need to add the following lines to my main script

torch.manual_seed(0)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

why isnt this included by default if torch recognizes GPU computations?

The article also mentions

When running on the CuDNN backend…

Is there a way to use pytorch on the GPU and not use CuDNN as well? oO

And what would be the consequences if I dont include these lines?

I guess what I am saying is, for an information that seems to be of such utter importance to the performance, I find it odd that it is somewhat hidden in the depth of the documentation…

albanD · April 4, 2019, 9:27am

Hi,

why isnt this included by default if torch recognizes GPU computations?

Because the deterministic algorithms are significantly slower than non-deterministic ones.

Is there a way to use pytorch on the GPU and not use CuDNN as well?

You can set torch.backends.cudnn.enabled = False. pytorch has all the necessary cuda kernels to run without cudnn. The big advantage of cudnn is that it has a collection of algorithms for tasks like convolutions and will choose the most suited one for the given input/weight size.

And what would be the consequences if I dont include these lines?

If you don’t include these lines, you will get the fastest algorithms but potentially non-deterministic behaviour.

Keep in mind that non-deterministic here means that both results are correct (in the sense of float operations can have multiple correct output like a + b != b + a, or an argmax can return any index of the elements achieving the max value) but which of these correct results is returned can vary from one run to the other.

Carsten_Ditzel · April 4, 2019, 9:32am

thank you for the explanations.

So I could expect the accuracy and loss to look mostly similar if I ran two times the very same script and the same data? Variations assumed only due to random-weight initialization and the shuffling of the dataloader.

albanD · April 4, 2019, 9:37am

You can see the non-determinism of the operations in the network as having the same effet as using a different weight initialization or a different shuffling of the dataloader !
It will give you completely different weights in the end but if your training is “stable”, the performance of the final network will be mostly similar.