DistributedDataParallel taking twice more time then DataParallel

I used implementation of DataParallel from
https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html

and DistributedDataParallel from Distributed Data Parallel — PyTorch master documentation

As shown in the DataParallel docs
Optional: Data Parallelism — PyTorch Tutorials 1.12.0+cu102 documentation I printed using DataParallel and DistributedDataParallel, the results are as shown below the batch size is 4 and images have 1 channel with 512x512 size.

------------------ DataParallel ------------------

In Model: input size torch.Size([2, 1, 512, 512])
In Model: input size torch.Size([2, 1, 512, 512])
Outside: input size torch.Size([4, 1, 512, 512]) output_size torch.Size([4, 1, 512, 512])

In Model: input size torch.Size([2, 1, 512, 512])
In Model: input size torch.Size([2, 1, 512, 512])
Outside: input size torch.Size([4, 1, 512, 512]) output_size torch.Size([4, 1, 512, 512])

In Model: input size torch.Size([1, 1, 512, 512])
Outside: input size torch.Size([1, 1, 512, 512]) output_size torch.Size([1, 1, 512, 512])

------------------ Distributed Data Parallel ------------------
In Model: input size torch.Size([4, 1, 512, 512])
Outside: input size torch.Size([4, 1, 512, 512]) output_size torch.Size([4, 1, 512, 512])

In Model: input size torch.Size([4, 1, 512, 512])
Outside: input size torch.Size([4, 1, 512, 512]) output_size torch.Size([4, 1, 512, 512])

In Model: input size torch.Size([4, 1, 512, 512])
Outside: input size torch.Size([4, 1, 512, 512]) output_size torch.Size([4, 1, 512, 512])

In Model: input size torch.Size([4, 1, 512, 512])
Outside: input size torch.Size([4, 1, 512, 512]) output_size torch.Size([4, 1, 512, 512])

In Model: input size torch.Size([1, 1, 512, 512])
Outside: input size torch.Size([1, 1, 512, 512]) output_size torch.Size([1, 1, 512, 512])

In Model: input size torch.Size([1, 1, 512, 512])
Outside: input size torch.Size([1, 1, 512, 512]) output_size torch.Size([1, 1, 512, 512])

As seen above, the dataparallel is taking 2-2 batch in each of the gpus whereas the DsitributedDataPrallel is taking 4 batch in each of the gpus.

So this DsitributedDataPrallel is taking exactly double the time as taken by DataParallel. Should I implement all Partition, DataPartitioner, partition_dataset, average_gradients like functions shown in the link (dist_tuto.pth/train_dist.py at gh-pages · seba-1511/dist_tuto.pth · GitHub)to make the DistributedDataParallel work?

And doing the things specified in this tutorial only (Getting Started with Distributed Data Parallel — PyTorch Tutorials 1.12.0+cu102 documentation) doesnot help??

Your use cases are not equal as you are doubling the batch size in the DDP use case.
Note that the data loading in DDP is done on each rank separately and you would usually use a DistributedSampler to avoid drawing duplicated samples.
Could you explain where you are currently stuck?