Weird behavior while evaluating using DDP

Hello Everyone,

I am using DDP while training (4 v100 GPUs), and using distributed sampler for the training and while testing set the sampler to none.
From my understanding setting the sampler to none while evaluating means the data will not be split into chunks and the four GPUs will run over the same data which means all of them will get the same results.
But when I evaluate after each epoch the four GPUs give me different accuracy.
Specifically, GPU 0 gives me different accuracy while the rest 3 GPUs are giving the same.

And after finishing the training when I load the weights then evaluate using exactly the same function, I got the same result across the 4 GPUs.

Thus my question is why GPU 0 gives a different accuracy while evaluating during the training. Despite I am using exactly the same function.

Thanks in advance.

Hey,

For some reason, GPU 0 has different values for weights than the other ranks during training.

It works as you expected after loading because all devices load the same weights and don’t modify them any further, giving the results you expect.

@pritamdamania87 @Yanli_Zhao Do you have an idea on why rank 0 would give different results in this case?