Training performance degrades with DistributedDataParallel

having the same issue (DP much better validation metrics than DDP). setting

torch.backends.cudnn.enabled = False

slows my runtime down by 3x.

monkey patching torch.nn.functional.batch_norm

def monkey_patch_bn():
    # print(inspect.getsource(torch.nn.functional.batch_norm))
    def batch_norm(input, running_mean, running_var, weight=None, bias=None,
                   training=False, momentum=0.1, eps=1e-5):
        if training:
            size = input.size()
            size_prods = size[0]
            for i in range(len(size) - 2):
                size_prods *= size[i + 2]
            if size_prods == 1:
                raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))

        return torch.batch_norm(
            input, weight, bias, running_mean, running_var,
            training, momentum, eps, False
        )
    torch.nn.functional.batch_norm = batch_norm

doesn’t seem to help.

train_dataloader.sampler.set_epoch(epoch) doesn’t seem to help either.

EDIT:

what does seem to work is dividing my lr by my world_size, although i’m not sure why.

@YueshangGu Hi, How do you use DataParallel + SyncBN at the same time? I though SyncBN only works with DistributedDataParallel.

Another issue spotted in my case: a model has to be transferred to proper device before wrapping into DDP.

Our experience with Kinetics 400 using PyTorch 1.3 on a node with two GPUs as follows:

Single GPU > DP (-0.2%) > DP w/ sync BN (-0.3%)

Single GPU serves as the baseline for DP and DP w/ sync BN.
The tradeoff with distributed training is understandable but sync BN causing worse accuracy is not trivial to ignore.

My setting is same with you, just testing in HMDB51. I also get the results as follows:
DP>DP w/sync BN. Now, Do you find the solution to deal with this issue?

Hi guys, where can i find the code of SyncBN?

Maybe the learning rate is the problem?

Maybe you are right. When using DDP+syncbn, bn is computated with a larger batch. The learning rate should be tuned a bit higher (original_lr * num_gpus).

Strangely, using DistributedSampler degrades the performance in my case .

I’m not sure about the effect of DistributedSampler in DistributedDataParallel.

Can you tell me how this monkey patching is done? Which file is this

Hi do you disable cudnn for the whole project, or just on the batchnorm files?

The main reason may be DDP estimates global var~(lr is another reason, but still decrease about 0.2%), see https://github.com/pytorch/pytorch/pull/14267#issuecomment-449125620

source code is pretty straightforward

Hi.

I got the same problem.
Updating pytorch to ver. 1.6.0 didn’t help, although it seems that they fixed several things in SyncBN.

Did anybody got an improvement?

Hi @TT_YY, Have you tried setting torch.backends.cudnn.enabled = False?

Hi @rvarm1

Thank you for your response and sorry for delay of my response.
I will try it.

Thanks.

Did you ever resolve this issue? Using DDP + SyncBN does not help.

3 Likes

Not sure this is the case for you, but in my case I was using autocast and GradScaler. I had both set to enabled=False. According to the docs this should mean these should have no effect, which was in fact the case with a single GPU and using DP.

However, with DDP I found that introducing these increased variance in the training and validation loss significantly, deteriorating model accuracy overall. According to the docs autocast and GradScaler shouldn’t adversely affect DDP, but it did just that in my case. Not sure why, but I assume it has to do with gradient synchronization in DDP.

1 Like

Do you use ‘loss = losses.sum()’ ? DDP default the loss type is average form, and average the gradient over all ranks. If a sum form loss is used, it produces wrong result.

Hi, I got the same issue. Did you solve the problem?