I am using DDP while training (4 v100 GPUs), and using distributed sampler for the training and while testing set the sampler to none.
From my understanding setting the sampler to none while evaluating means the data will not be split into chunks and the four GPUs will run over the same data which means all of them will get the same results.
But when I evaluate after each epoch the four GPUs give me different accuracy.
Specifically, GPU 0 gives me different accuracy while the rest 3 GPUs are giving the same.
And after finishing the training when I load the weights then evaluate using exactly the same function, I got the same result across the 4 GPUs.
Thus my question is why GPU 0 gives a different accuracy while evaluating during the training. Despite I am using exactly the same function.
Thanks in advance.