Maybe I might not have made myself clear. Let n1 be number of gpus in case 1 and n2 be number of gpus in case 2. In case 1, batch size per gpu is (dataset size)/n1 i.e. total batch size is dataset size. Similarly for case 2. Gist is that total batch size/effective batch size is same in both cases. But results in each case are different. Are you saying the results will be different in these 2 cases? If so, how can I make to be for varying number of gpus?
Yes, since the local batch size differs between your setups, different algorithms can be selected in math libraries. This could then not create bitwise identical outputs between the setups, but should still be deterministic if the same setup is executed repeatedly.