Distributed Data Parallel no_sync

lkp411 · February 1, 2021, 11:14am

Are forward and backward still synchronization points in DDP even if they are inside a no_sync() context? My understanding is that no_sync prevents gradient averaging, but I was wondering if it disables syncing completely.

mrshenli · February 1, 2021, 4:13pm

Hey @lkp411

no_sync() disables the following code, which should completely disable communications across processes (unless you are using DDP.join()).

github.com

pytorch/pytorch/blob/b1907f5ebcaeb7405aa3d9d8025a7a5eeb0ee590/torch/nn/parallel/distributed.py#L710-L720


if torch.is_grad_enabled() and self.require_backward_grad_sync:
    self.require_forward_param_sync = True
    # We'll return the output object verbatim since it is a freeform
    # object. We need to find any tensors in this object, though,
    # because we need to figure out which parameters were used during
    # this forward pass, to ensure we short circuit reduction for any
    # unused parameters. Only if `find_unused_parameters` is set.
    if self.find_unused_parameters:
        self.reducer.prepare_for_backward(list(_find_tensors(output)))
    else:
        self.reducer.prepare_for_backward([])

lkp411 · February 1, 2021, 9:26pm

Hey @mrshenli

Thanks so much for you reply. What are the other sync points in DDP? I remember reading somewhere here that copying data from the gpu to the cpu forces a sync. Is this true? And does it still happen in a no_sync context?

Once again, thanks for your time.

ptrblck · February 2, 2021, 4:46am

Yes, this is true, since you would need the calculated tensor in order to push it to the CPU (thus the CPU has to synchronize on the GPU workload).
This should also be the case in a DDP setup with no_sync().

mrshenli · February 2, 2021, 6:28pm

Hey @lkp411, in the context of DDP, there are two different types of synchronizations, intra-process (CUDA stream) and inter-process (collective comm). The gradient averaging (AllReduce) is inter-process sync and the CPU-to-GPU copy is a intra-process sync. Which ones are you referring to in the following question?

What are the other sync points in DDP?

lkp411 · February 2, 2021, 9:29pm

I was referring to interprocess syncs. Are there any other sync points besides the ddp constructor, forward, backward and collective communication calls? And does calling torch.cuda.synchronize(rank) in a specific process after issuing an async collective communication call (for example an all_reduce) block till the result of the collective communication call is available?

Also since we’re on the topic, are there plans to add sparse all_reduce capabilities to the NCCL backend?