I’m trying to train a network using cudnn but at every execution I’m getting different results. I have no idea why, as I’m trying to ensure determinism in every way I know.
also I use num_workers=0 in the dataloader and I have manually checked that the input data on the network is always the same in every execution.
The parameters of the network also are also initialized in the same way, but as soon as the second/thirds batch comes in, some parameters and outputs of the network start to change sligthly leading to diferent training results.
I’m also struggling with reproducibility, and I’m interested to see what the solution(s) discovered by this thread are. By the way, did you try checking with cpu, and seeing if the cpu version is more reproducible?
Know this convo is a little old but I’m under the impression there’s some non-determinism in a few cuDNN operations, like atomic adds on floating points? Might be the issue here