Average loss in DP and DDP

mrshenli · August 23, 2020, 4:11pm

Yep, this is true for the sum() / batch_size case you mentioned, on the condition that all processes are using the same batch size. Here is the test to verify that:

github.com

pytorch/pytorch/blob/97d594b9f72e7c7baf877f2394d8a5aaeda3140d/test/distributed/test_distributed.py#L2033-L2072


      
          def _test_DistributedDataParallel(self, gpu_subset, rank, output_device=None):
              # Run a simple end to end DDP model, use result of single node model
              # as baseline
          
              # cpu training setup
              model = DDP_NET
          
              # single gpu training setup
              model_gpu = copy.deepcopy(model)
              model_gpu.cuda(gpu_subset[0])
          
              # DDP training setup
              model_DDP = copy.deepcopy(model)
              model_DDP.cuda(gpu_subset[0])
              model_DDP = nn.parallel.DistributedDataParallel(
                  model_DDP, device_ids=gpu_subset
              )
          
              # test serializable/unserializable
              with tempfile.NamedTemporaryFile() as tmp:

This file has been truncated. show original

In particular, the gradient from DDP might be n_gpu times larger than DP, where n_gpu is the number of GPUs. Even if this is true, that will not be a big problem, but DDP may require a different learning rate from DP. I just thought that way, but it needs a confirmation.

DDP computes the average of all gradients from all processes, so the gradient should be the same value as local training for the sum() / batch_size case. What might affect the learning rate is the batch size you configured for each DDP process. If each process is using the same batch_size as local training, it means that in each iteration the DDP gang collective process world_size * batch_size input data, so you might be more confident on the result gradient compared to local training and might need to set the learning rate to a larger value. But this is not guaranteed. See this discussion: Should we split batch_size according to ngpu_per_node when DistributedDataparallel