How average gradients compute when DDP model run in different speed?

LW-Ricarido · November 18, 2020, 5:27am

Hi, I have a question about DDP computing average. I read in DDP backward pass that says when all buckets are ready, local Reducer will block waiting for all allreduce to opertions to finish. But what will happen when several GPU run in different speed? For example, when two GPUs(e.g. cuda:0 and cuda:1) run 1.5x faster than other GPUs(codes and processing are the same), cuda:0 and cuda:1 will produce more gradients. Will they save these gradients in the bucket and wait for other GPUs to get ready, or they just abandon these gradients and reducer gradients that are ready in all GPUs?

Thanks a lot.

seungjun · November 18, 2020, 9:29am

The faster GPU processes will wait for other GPUs to finish their backward computation.
By default, DDP synchronizes gradients and parameters and then performs the next forward computation.
The differing speed case is quite common in practice.

You can take a look at the forward function of DDP.

github.com

pytorch/pytorch/blob/master/torch/nn/parallel/distributed.py#L675


# Calling _rebuild_buckets before forward compuation,
# It may allocate new buckets before deallocating old buckets
# inside _rebuild_buckets. To save peak memory usage,
# call _rebuild_buckets before the peak memory usage increases
# during forward computation.
# This should be called only once during whole training period.
if self.reducer._rebuild_buckets():
    logging.info("Reducer buckets have been rebuilt in this iteration.")
if self.require_forward_param_sync:
    self._sync_params()
if self.ddp_uneven_inputs_config.ddp_join_enabled:
    # Notify joined ranks whether they should sync in backwards pass or not.
    self._check_global_requires_backward_grad_sync(is_joined_rank=False)
if self.device_ids:
    if len(self.device_ids) == 1:
        inputs, kwargs = self.to_kwargs(inputs, kwargs, self.device_ids[0])
        output = self.module(*inputs[0], **kwargs[0])
    else:

osalpekar · November 25, 2020, 12:24am

Just to add some more insight to this, we have a bucket_cap_mb argument in the DDP constructor. This defines the size of a gradient bucket in megabytes. During the backward pass, each rank fills the bucket with gradients and then kicks off the allreduce collective. Faster ranks will kick off the allreduce collective earlier than the slower ranks, so they will just block until the slower ranks kick off the collective. No gradients are abandoned in this process. You can tune the bucket_cap_mb as desired, and this will trigger allreduce more frequently for smaller buckets and less frequently for larger buckets.

If the performance difference is too great, you can explore syncing gradients less frequently (every n batches instead of every batch) using the model.no_sync() context manager or using multiple process groups (using the new_group API). If you find DDP training getting stuck due to excessively long hang-times due to these blocked collectives, you may look into using torchelastic and some mechanism to timeout hanging collectives (such as NCCL_ASYNC_ERROR_HANDLING or NCCL_BLOCKING_WAIT - docs here)