Why torch.nn.parallel.DistributedDataParallel runs faster than torch.nn.DataParallel on single machine with multi-gpu?

as mentioned in doc:
" This is the highly recommended way to use DistributedDataParallel , with multiple processes, each of which operates on a single GPU. This is currently the fastest approach to do data parallel training using PyTorch and applies to both single-node(multi-GPU) and multi-node data parallel training. It is proven to be significantly faster than torch.nn.DataParallel for single-node multi-GPU data parallel training."

In the single-machine synchronous case, torch.distributed or the torch.nn.parallel.DistributedDataParallel() wrapper may still have advantages over other approaches to data-parallelism, including torch.nn.DataParallel() :

  • Each process maintains its own optimizer and performs a complete optimization step with each iteration. While this may appear redundant, since the gradients have already been gathered together and averaged across processes and are thus the same for every process, this means that no parameter broadcast step is needed, reducing time spent transferring tensors between nodes.
  • Each process contains an independent Python interpreter, eliminating the extra interpreter overhead and “GIL-thrashing” that comes from driving several execution threads, model replicas, or GPUs from a single Python process. This is especially important for models that make heavy use of the Python runtime, including models with recurrent layers or many small components.

Here is my understanding.
DistributedDataParallel generete one process on every gpu. During every batch, every process will work on batch_size/k samples (k is number of gpu), and they won’t be gathered until all batches are trained.
Is it right?
And correspondingly, how torch.nn.DataParallel works different?