Why might DDP perform worse than DP?

Hi,

I’ve seen some discussions about DDP vs DP here but mainly focused around the learning rate. In my case both are taking the mean of the gradients from the GPU but I am seeing consistently somewhat worse performance in terms of loss and additional metrics from DDP than with DP. I am using same # of GPUs, same BS, same CrossEntropy Loss and other hyperparameters are kept the same as well. The epoch is set during every new epoch of DDP as well.

This is on a GPT-2 line language model so batch-norm cannot be the answer. In my understanding if the gradients are both averaged across GPUs and there is no batch-norm the two methods should be consistent except DDP is just much faster, however, obviously my understanding here is wrong I just haven’t found any explanation as to why this wouldn’t be the case.

Additionally, the seed is set for model initialization and many runs have been done as well.

Thanks!

I’m assuming you mean “DDP is just much slower” here? Do you have some sample code illustrating this performance gap? If so, it is much easier for us to troubleshoot why DDP might be much slower than DP in certain use cases.

Unfortunately I don’t really have sample code, what I mean is that I am receiving consistently slightly worse results in terms of loss. DDP is much faster than DP which I believe it is supposed to be.

Unfortunately I don’t really have sample code, what I mean is that I am receiving consistently slightly worse results in terms of loss.

Ah my bad, sorry I misread the original question. I’m wondering if you could confirm that the only change between the two runs is that in one case we wrap the model with DDP() and in the other with DP(). Are you using a single process per GPU for DDP or one process driving all GPUs? Also, are you using GLOO or NCCL for the process group backend?

Hi yes, so DDP is being launched using the “python -m torch.distributed.launch script.py” framework, and yes, the only difference is the wrapping.

DDP is running a single process per GPU
DP is running a single process total
NCCL is the backend I am using, and I believe the recommended one from the pytorch docs.

I see, could you also confirm that the batch size for DP is equal to the sum of batch size fed to each DDP process? For DDP the effective batch size is the global batch size across all DDP workers.

Yes, that is the case, the total batch size is equivalent for both.

In terms of the discrepancy in loss across DDP and DP, does this happen from the first iteration itself or it slowly creeps in where after a large number of iterations you see some deviations? Also, what is the percentage difference between the loss you see for DDP and DP at the end of training?

It is more so that it converges to a slightly higher level. The best eval loss would be like .17 for DP and .175 for DDP. Although it is a small difference, it is a meaningful difference, effects downstream performance and is consistent across runs.

I see, it is hard to track down the root cause without having a repro that we can run ourselves. Is it possible to share your code and the steps to reproduce this issue if your code is OSS?

Unfortunately it isn’t, I understand that the direct problem is probably not solvable due to that. Was more looking for possible reasons why that would occur, as in my mind the results should be identical.

Are you seeing this worse convergence for different runs using different seed (if used), i.e. is it reproducible?

I also got this problem and tried almost every solution seen in this forum. have you solve this problem yet?

I think the difference in results between DDP and DP might have to do with the fact that DP computes the loss and grads on the entire batch, whereas DDP computes loss and grads on individual minibatches and then averages the grads. As a result, if there is some computation where f(x + y) != f(x) + f(y), DDP might provide different results.

A simple example simulating this:

from torch import nn
import torch

import torch
from torch.autograd import Function

class MyFunc(Function):
    @staticmethod
    def forward(ctx, inp):
        return inp

    @staticmethod
    def backward(ctx, inp):
        return inp * inp

loss = nn.CrossEntropyLoss()
input = torch.randn(4, 5, requires_grad=True)
target = torch.empty(4, dtype=torch.long).random_(5)

# Simulate DP
b1 = input[0:2]
b2 = input[2:4]

b1 = MyFunc.apply(b1)
b2 = MyFunc.apply(b2)
output = loss(torch.cat((b1, b2), 0), target)
output.backward()
print (input.grad)
grad1 = input.grad
input.grad = None

# Simulate DDP
b1 = input[0:2]
b2 = input[2:4]
t1 = target[0:2]
t2 = target[2:4]
b1 = MyFunc.apply(b1)
output = loss(b1, t1)
output.backward()
b2 = MyFunc.apply(b2)
output = loss(b2, t2)
output.backward()
print (input.grad/2)
grad2 = input.grad/2
print (grad1 == grad2)
3 Likes

Hi, thank you for the response and the code. If you change the third to last and second to last line to:

print (input.grad/4)
grad2 = input.grad/4

You will get the equal gradients. So this suggests its a scaling issue and tuning the learning rate should solve it rather than a more fundamental issue where the mean of the minibatch losses doesn’t equal the total batch loss.

My above example is very simple and as a result such scaling might resolve the issue. However for a complex model, it might not be clear how to scale grads/learning rate since it could contain a large number of complex functions of the form f(x + y) ! = f(x) + f(y) which result in this inconsistency.

Ah, great that makes sense, thank you for the clarification. I’m gonna do some gradient testing now. Thank you!

Hi, the issue here turned out to be with padding, so apologies with the misleading original post. I thought we had investigated it fully but we had not. I’ve added more of a description with some possible helpful notes here:

1 Like

Not sure this is the case for anyone here, but in my case I was using autocast and GradScaler. I had both set to enabled=False. According to the docs this should mean these should have no effect, which was in fact the case with a single GPU and using DP.

However, with DDP I found that introducing these increased variance in the training and validation loss significantly, deteriorating model accuracy overall. According to the docs autocast and GradScaler shouldn’t adversely affect DDP, but it did just that in my case. Not sure why, but I assume it has to do with gradient synchronization in DDP.