Batchsize with DDP v.s. without DDP

hhxx · March 12, 2020, 8:37pm

In distributeddataparallel, when local batch-size is 64 (i.e. torch.utils.data.DataLoader(batch_size=64) and torch.utils.data.distributed.DistributedSampler() is used), assume there are N processes totally in ddp (N processes distirbute in one node or more than one node). Is the forward-backward process in ddp similar to the forward-backward process in a single gpu using 64×N batch-size inputs?

Yanli_Zhao · March 12, 2020, 11:03pm

yes, distributed training using DDP is mathematically equivalent to local training

AMellinger · July 9, 2021, 2:19pm

Can you clarify this? The OP is asking if batch_size of 64 per DDP process in a world size of N is the same as a single gpu with a total batch size of 64*N. There is a note in the DDP docs which state:

“When a model is trained on M nodes with batch=N , the gradient will be M times smaller when compared to the same model trained on a single node with batch=M*N if the loss is summed (NOT averaged as usual) across instances in a batch (because the gradients between different nodes are averaged). You should take this into consideration when you want to obtain a mathematically equivalent training process compared to the local training counterpart. But in most cases, you can just treat a DistributedDataParallel wrapped model, a DataParallel wrapped model and an ordinary model on a single GPU as the same (E.g. using the same learning rate for equivalent batch size).”

It looks to me that they say that they should be the same if the single GPU case if the batch size is 64 and not 64*N (as the OP asked). Can you clarify?

Thanks for any help!