Training performance degrades with DistributedDataParallel

JimFan · June 5, 2019, 5:11pm

I’m training a conv model using DataParallel (DP) and DistributedDataParallel (DDP) modes. For DDP, I only use it on a single node and each process is one GPU.
My model has many BatchNorm2d layers. Given all other things the same, I observe that DP trains better than DDP (in classification accuracy). Even if I add SyncBN from pytorch 1.1, I still observe that DP > DDP+SyncBN > DDP without SyncBN in test accuracy.

I’m aware of the difference between DP and DDP’s handling of averaging/sum: Is average the correct way for the gradient in DistributedDataParallel with multi nodes?
The LR and total batch size are the same for both DP, DDP+SyncBN, and DDP.

If I understand correctly, DP doesn’t do SyncBN, so DP should in theory achieve the same test accuracy as DDP (given small batch size per GPU)? If we assume larger effective batch size leads to better result, I should expect the following test performance ranking:

DDP+SyncBN > DP == DDP

but in practice, I observe: DP > DDP+SyncBN > DDP

Because DDP+SyncBN is 30% faster than DP, I really hope to solve the training gap so that I can take advantage of DDP’s superior speed. Thanks for any help!

enisberk · June 5, 2019, 5:38pm

Hi Jim,
From docs
DistributedDataParallel can be used in the following two ways:
(1) Single-Process Multi-GPU
(2) Multi-Process Single-GPU
Second method the highly recommended way to use DistributedDataParallel, with
multiple processes, each of which operates on a single GPU. This is
currently the fastest approach to do data parallel training using PyTorch
and applies to both single-node(multi-GPU) and multi-node data
parallel training.

Which one are you using ?

JimFan · June 5, 2019, 5:39pm

I’m using (2). More specifically, I have one node with 8 GPUs. I launch DDP with 8 separate processes, each one owns a single GPU.

enisberk · June 5, 2019, 5:46pm

I do not have a good guess then

Do you have any CPU heavy pre-processing, do you report only GPU performance ?
Which back-end are you using, NCCL ?

JimFan · June 5, 2019, 5:52pm

By “performance”, I mean the classification accuracy. Somehow DDP+SyncBN achieves worse test accuracy than DP, so there must be some problematic differences in the numeric algorithm. The speed isn’t the issue here. Thanks!

enisberk · June 5, 2019, 5:54pm

My mistake, I got it wrong. Thanks for clarification.

pietern · June 24, 2019, 6:08am

I can only comment on the differences between DP and DDP w.r.t. batch normalization. With DP your module is replicated before each call to forward, which means that only the BN stats from the first replica are kept around. With DDP each process keeps their own version of BN. And with SyncBN you’ll end up with stats that are “more averaged” than the stats kept when using DP, because they only include stats for a batch in a single replica, instead of all replicas.

Mr.Z · July 1, 2019, 2:32am

I also encountered this performance issue with DistributedDataParallel, hope someone could give a solution.

Mr.Z · July 4, 2019, 2:35am

I found the problem in my code, it’s because of the cudnn batch norm. According to this github issue, the solution is to edit the batchnorm part in torch/nn/functional.py or set torch.backends.cudnn.enabled = False.

JingLi · July 29, 2019, 7:52am

Could edit the batchnorm part in torch/nn/functional.py work for sync bn?

github.com

pytorch/pytorch/blob/91d28026f8e8386d45dfb7ddab84c906479dead1/torch/nn/modules/batchnorm.py#L450


        else:  # use exponential moving average
            exponential_average_factor = self.momentum


    world_size = 1
    process_group = torch.distributed.group.WORLD
    if self.process_group:
        process_group = self.process_group
    world_size = torch.distributed.get_world_size(process_group)


    # fallback to framework BN when synchronization is not necessary
    if world_size == 1 or (not self.training and self.track_running_stats):
        return F.batch_norm(
            input, self.running_mean, self.running_var, self.weight, self.bias,
            self.training or not self.track_running_stats,
            exponential_average_factor, self.eps)
    else:
        return sync_batch_norm.apply(
            input, self.weight, self.bias, self.running_mean, self.running_var,
            self.eps, exponential_average_factor, process_group, world_size)


@classmethod

The batch norm in torch.nn.functional is used just for evaluation. I think editing this would do nothing to sync batch norm. How do you edit the file to make sync bn work normally?

Mr.Z · August 1, 2019, 12:14pm

You are right, although the performance improves after disable cudnn, the gap still remains. I can’t figure out the problem and now I have to use nn.DataParallel .

YueshangGu · August 8, 2019, 12:32pm

@Mr.Z Do you find the problem? I also get a very worse accuracy when use SyncBN + DDP for batchsize=16( 4 GPUs on one node, 4 images for each GPU), and when I use DataParallel + SyncBN, evrything is OK.

Sergii_Makarevych · August 27, 2019, 9:10am

Same here. Performance of DDP model is weaker than one trained on a single GPU. Playing with lr/bs does not help. As number of GPUs in DPP training grows - performance degrades.

Has anyone found the solution ?

UPDT: the reason was found for my case. When training DDP model we need to use DistributedSampler which is passed to Dataloader. We need to train_dataloader.sampler.set_epoch(epoch) on every epoch start.

makslevental · September 2, 2019, 11:46pm

having the same issue (DP much better validation metrics than DDP). setting

torch.backends.cudnn.enabled = False

slows my runtime down by 3x.

monkey patching torch.nn.functional.batch_norm

def monkey_patch_bn():
    # print(inspect.getsource(torch.nn.functional.batch_norm))
    def batch_norm(input, running_mean, running_var, weight=None, bias=None,
                   training=False, momentum=0.1, eps=1e-5):
        if training:
            size = input.size()
            size_prods = size[0]
            for i in range(len(size) - 2):
                size_prods *= size[i + 2]
            if size_prods == 1:
                raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))

        return torch.batch_norm(
            input, weight, bias, running_mean, running_var,
            training, momentum, eps, False
        )
    torch.nn.functional.batch_norm = batch_norm

doesn’t seem to help.

train_dataloader.sampler.set_epoch(epoch) doesn’t seem to help either.

EDIT:

what does seem to work is dividing my lr by my world_size, although i’m not sure why.

Beinan_Wang · September 13, 2019, 4:50pm

@YueshangGu Hi, How do you use DataParallel + SyncBN at the same time? I though SyncBN only works with DistributedDataParallel.

Sergii_Makarevych · October 22, 2019, 5:45am

Another issue spotted in my case: a model has to be transferred to proper device before wrapping into DDP.

farleylai · October 22, 2019, 3:09pm

Our experience with Kinetics 400 using PyTorch 1.3 on a node with two GPUs as follows:

Single GPU > DP (-0.2%) > DP w/ sync BN (-0.3%)

Single GPU serves as the baseline for DP and DP w/ sync BN.
The tradeoff with distributed training is understandable but sync BN causing worse accuracy is not trivial to ignore.

jinyuanfeng · December 5, 2019, 12:42pm

My setting is same with you, just testing in HMDB51. I also get the results as follows:
DP>DP w/sync BN. Now, Do you find the solution to deal with this issue?

zhi_lim · December 20, 2019, 7:06am

Hi guys, where can i find the code of SyncBN?

ginobilinie · December 29, 2019, 5:49am

Maybe the learning rate is the problem?