Weird behavior while evaluating using DDP

eslam_bakr · April 8, 2022, 4:31pm

Hello Everyone,

I am using DDP while training (4 v100 GPUs), and using distributed sampler for the training and while testing set the sampler to none.
From my understanding setting the sampler to none while evaluating means the data will not be split into chunks and the four GPUs will run over the same data which means all of them will get the same results.
But when I evaluate after each epoch the four GPUs give me different accuracy.
Specifically, GPU 0 gives me different accuracy while the rest 3 GPUs are giving the same.

And after finishing the training when I load the weights then evaluate using exactly the same function, I got the same result across the 4 GPUs.

Thus my question is why GPU 0 gives a different accuracy while evaluating during the training. Despite I am using exactly the same function.

Thanks in advance.

kumpera · April 11, 2022, 3:34pm

Hey,

For some reason, GPU 0 has different values for weights than the other ranks during training.

It works as you expected after loading because all devices load the same weights and don’t modify them any further, giving the results you expect.

@pritamdamania87 @Yanli_Zhao Do you have an idea on why rank 0 would give different results in this case?

hendrikl · July 15, 2022, 8:12pm

Hi @eslam_bakr , I ran into sort of the same issue. Do you already know why this is the case? My problem is that I get 4 completely different evaluation results… How do you ‘terminate’ after training?

eslam_bakr · July 24, 2022, 6:39pm

I am not sure what is the root cause but for me I am using only one GPU for evaluation to overcome this issue, using the following code:

if not args.multiprocessing_distributed or (
                        args.multiprocessing_distributed and args.rank % ngpus_per_node == 0):
eval_function()