I’ve seen some discussions about DDP vs DP here but mainly focused around the learning rate. In my case both are taking the mean of the gradients from the GPU but I am seeing consistently somewhat worse performance in terms of loss and additional metrics from DDP than with DP. I am using same # of GPUs, same BS, same CrossEntropy Loss and other hyperparameters are kept the same as well. The epoch is set during every new epoch of DDP as well.
This is on a GPT-2 line language model so batch-norm cannot be the answer. In my understanding if the gradients are both averaged across GPUs and there is no batch-norm the two methods should be consistent except DDP is just much faster, however, obviously my understanding here is wrong I just haven’t found any explanation as to why this wouldn’t be the case.
Additionally, the seed is set for model initialization and many runs have been done as well.
I’m assuming you mean “DDP is just much slower” here? Do you have some sample code illustrating this performance gap? If so, it is much easier for us to troubleshoot why DDP might be much slower than DP in certain use cases.
Unfortunately I don’t really have sample code, what I mean is that I am receiving consistently slightly worse results in terms of loss.
Ah my bad, sorry I misread the original question. I’m wondering if you could confirm that the only change between the two runs is that in one case we wrap the model with DDP() and in the other with DP(). Are you using a single process per GPU for DDP or one process driving all GPUs? Also, are you using GLOO or NCCL for the process group backend?
In terms of the discrepancy in loss across DDP and DP, does this happen from the first iteration itself or it slowly creeps in where after a large number of iterations you see some deviations? Also, what is the percentage difference between the loss you see for DDP and DP at the end of training?
It is more so that it converges to a slightly higher level. The best eval loss would be like .17 for DP and .175 for DDP. Although it is a small difference, it is a meaningful difference, effects downstream performance and is consistent across runs.
Unfortunately it isn’t, I understand that the direct problem is probably not solvable due to that. Was more looking for possible reasons why that would occur, as in my mind the results should be identical.
I think the difference in results between DDP and DP might have to do with the fact that DP computes the loss and grads on the entire batch, whereas DDP computes loss and grads on individual minibatches and then averages the grads. As a result, if there is some computation where f(x + y) != f(x) + f(y), DDP might provide different results.
Hi, thank you for the response and the code. If you change the third to last and second to last line to:
grad2 = input.grad/4
You will get the equal gradients. So this suggests its a scaling issue and tuning the learning rate should solve it rather than a more fundamental issue where the mean of the minibatch losses doesn’t equal the total batch loss.
My above example is very simple and as a result such scaling might resolve the issue. However for a complex model, it might not be clear how to scale grads/learning rate since it could contain a large number of complex functions of the form f(x + y) ! = f(x) + f(y) which result in this inconsistency.
Hi, the issue here turned out to be with padding, so apologies with the misleading original post. I thought we had investigated it fully but we had not. I’ve added more of a description with some possible helpful notes here:
Not sure this is the case for anyone here, but in my case I was using autocast and GradScaler. I had both set to enabled=False. According to the docs this should mean these should have no effect, which was in fact the case with a single GPU and using DP.
However, with DDP I found that introducing these increased variance in the training and validation loss significantly, deteriorating model accuracy overall. According to the docs autocast and GradScaler shouldn’t adversely affect DDP, but it did just that in my case. Not sure why, but I assume it has to do with gradient synchronization in DDP.