Different number of GPUs with DDP give different results

Consider following cases:

  1. I train my network with 2 GPUs such that batch size = dataset size/2.
  2. I train my network with 4 GPUs such that batch size = dataset size/4.
    Assume dataset size is divisible by 4.

Results I get in each case is different. I fix all random seeds.

This is expected as PyTorch does not guarantee to create bitwise identical and deterministic results in different setups.

Maybe I might not have made myself clear. Let n1 be number of gpus in case 1 and n2 be number of gpus in case 2. In case 1, batch size per gpu is (dataset size)/n1 i.e. total batch size is dataset size. Similarly for case 2. Gist is that total batch size/effective batch size is same in both cases. But results in each case are different. Are you saying the results will be different in these 2 cases? If so, how can I make to be for varying number of gpus?

Thanks for your reply.

Yes, since the local batch size differs between your setups, different algorithms can be selected in math libraries. This could then not create bitwise identical outputs between the setups, but should still be deterministic if the same setup is executed repeatedly.