Distributed Data Parallel no_sync

Are forward and backward still synchronization points in DDP even if they are inside a no_sync() context? My understanding is that no_sync prevents gradient averaging, but I was wondering if it disables syncing completely.

1 Like

Hey @lkp411

no_sync() disables the following code, which should completely disable communications across processes (unless you are using DDP.join()).

Hey @mrshenli

Thanks so much for you reply. What are the other sync points in DDP? I remember reading somewhere here that copying data from the gpu to the cpu forces a sync. Is this true? And does it still happen in a no_sync context?

Once again, thanks for your time.

Yes, this is true, since you would need the calculated tensor in order to push it to the CPU (thus the CPU has to synchronize on the GPU workload).
This should also be the case in a DDP setup with no_sync().

Hey @lkp411, in the context of DDP, there are two different types of synchronizations, intra-process (CUDA stream) and inter-process (collective comm). The gradient averaging (AllReduce) is inter-process sync and the CPU-to-GPU copy is a intra-process sync. Which ones are you referring to in the following question?

What are the other sync points in DDP?

I was referring to interprocess syncs. Are there any other sync points besides the ddp constructor, forward, backward and collective communication calls? And does calling torch.cuda.synchronize(rank) in a specific process after issuing an async collective communication call (for example an all_reduce) block till the result of the collective communication call is available?

Also since we’re on the topic, are there plans to add sparse all_reduce capabilities to the NCCL backend?